Find Top Ten Words in an Earnings Call Transcript

Word count analysis on stock Earnings Conference Call Transcripts is where most people start. What are the top 10 words used in the call for a specific company? Do they talk about “Customer” or “Shortage” or “Hurricane”? Understanding such trends year over year can be useful in understanding overall company direction, if they ever had a tough year, what was the cause for it, and how they dealt with it.

Let’s see an example by analyzing Amazon earning calls through the year 2018.

Import Libraries and Set Path

We’ll start with importing some Python libraries.

import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Next, let’s make sure to navigate our interpreter to the folder containing the index.txt file

path_to_index = 'C:/PATH_TO_INDEX.TXT/'
os.chdir(path_to_index)

Read Data Into Pandas DataFrame

Read the file into the Pandas dataframe, and check out the data types.
Note how we define separator, header row and index column.
Call time, Stock Symbol and Name are strings, while Reporting Year and Quarter are integers.

df = pd.read_csv('index.txt', sep='|', header=0, index_col=0)
print(df.dtypes)
call_time            object
stock_symbol         object
stock_name           object
reporting_year        int64
reporting_quarter     int64
dtype: object

Now, each index value corresponds to a file in the /transcripts/ folder containing the actual conference call text.
So let’s get the indexes corresponding to symbol ‘AMZN’ and reporting year 2018.
Note how stock_symbol is a string and reporting_year is an integer in our filter.
There are four quarters in a year, so we receive 4 indexes as a result.

amzn_2018 = df[(df['stock_symbol']=='AMZN') & (df['reporting_year']==2018)].index.to_list()
print(amzn_2018)
[122363, 126342, 130428, 134348]

Prepare Text Into Corpus

Next is reading the actual transcript texts into a corpus that can be consumed by scikit-learn

texts = []
for file_index in amzn_2018:
    with open('transcripts/'+str(file_index)+'.txt', 'r') as f:
        texts.append(f.read())

The Count Logic

And finally use CountVectorizer to convert all the text into a matrix of words and corresponding counts.
Because transcripts are a raw recording, the text still contains words like “the”, “and”, etc., which we’ll encounter in any document rather frequently. So we remove them by passing a parameter stop_words=’english’.

Next we will put the matrix into a dataframe and sort it by count in a descending order.
Lets see what we come up with.

cv = CountVectorizer(stop_words='english')   
cv_fit=cv.fit_transform(texts)    
word_list = cv.get_feature_names();    
count_list = cv_fit.toarray().sum(axis=0)    

df_words = pd.DataFrame({'word':word_list, 'count':count_list}).sort_values(by=['count'], ascending=False).reset_index(drop=True)
df_words.head(10)
        word  count
0       year    140
1   question    110
2      prime    100
3     growth     98
4    quarter     97
5       just     95
6      think     95
7         ve     86
8    revenue     86
9  customers     72

Lo and behold! In 2018 quarterly earning calls Amazon spoke about “Prime”, “Growth”, “Revenue” and “Customers”.
We also see there are some other less relevant words like “year”, “question”, “quarter”. Default stop-words in scikit-learn are quite generic, but due to the nature of our data, it can be said with a high degree of confidence these words will be encountered pretty frequently all across, and they carry much less meaning than other. This is a perfect example why data-cleansing is a task-specific exercise, so feel free to augment the stop-word list to fit your needs.
“ve” is clearly a left-over from “I’ve” and “We’ve”, and can be also safely added to a custom stop-words list.

Sources

View the complete code on GitHub.

Earnings Conference Call Transcripts available here: download now.