Why NLP Development Matters
Since August of 2015, the presidential hopefuls from both major political parties have been joining in the primary debates to jockey for the two coveted positions in the general presidential election later this fall. The debates have been spirited and full of rich information about each of the candidates. Back in February, the folks at About Techblog did an analysis of the candidates’ language use in the debates up to that time (see Analyzing the Language of the Presidential Debates). We thought it would be interesting to parse through all of the data, including the primary debates that have occurred since About Techblog did their analysis, using our own NLP development techniques, along with the following tools:
- NLTK (Natural Language Toolkit)
The American Presidency Project has transcribed presidential debates going back to 1960. All transcripts of the primary debates in this election cycle can be found here: Presidential Debates. In order to extract structure from this unstructured data, we performed the following NLP tasks:
- Sentence detection
- Lemmatization (finding the root forms of each word)
- Dependency parsing
Once the data is processed, we then begin slicing and dicing using reporting tools. First, below we show a simple word cloud in which each word is given a size depending on its frequency in the data relative to all of the other words in the data. We used a stop list to filter out words of high frequency that do not add rich semantic information.
A word cloud like this is useful to get a very broad perspective of general ideas and themes in the data. For example, one would expect presidential candidates to talk in high frequencies about the “people” who will be voting. From here we need to get a little deeper to figure out what about “people” is being discussed.
Next, we show a cloud of the linguistic relationships produced by the dependency parse phase. To explain what you are seeing, each relationship consists of two words. The word on the left is the more important word in the relationship (in linguistics, this is called the “head”). Each relationship has a type. Certain types of relationship are shown below, such as Subject, Direct Object, and Modifier.
When dealing with customer-generated text data, it is vital to be able to view the data at more granular levels. This relationship cloud provides more context because it takes into account words that are linguistically/semantically related in various ways. For example, the fact that “Clinton > Hillary” (or “Hillary Clinton”) shows up so frequently indicates that most of the candidates from both parties feel that she is the most likely democratic candidate and so they expend most of their energy comparing their own positions to hers. She would not be expected to use her own name in her comments at any level of reliable frequency. Also in this cloud, we see that “people” from the previous word cloud actually refers to “people > American” (or “American people”). This is also expected since presidential candidates need to be seen as focusing on the individual rights of voters and so they call them out by using the phrase “American people”. Given that other relationships are also high frequency (such as “Street > Wall”, “care > health”, “Security > Social”, and “East > Middle”) we can safely assume that the candidates and the moderators who formulate the questions and topics of discussion believe these are issues that the American people are concerned about.
Next, we start looking at each candidate’s individual linguistic profile. How much each candidate speaks, the content of what they say, and the relative complexity of their speech may indicate a number of their attributes (such as education level or the demographic of voter that they are trying to attract). First, we show below a report that indicates the raw amount of speech produced by each of the candidates (indicated by number of words spoken in all of the debates combined).
Obviously, total number of words spoken is directly correlated to amount of time spent talking. Clinton and Sanders top the list because they have been the primary Democratic candidates and the only two debate participants since February, which covers more than half of the Democratic debates. On the Republican side, the more recent debates have had at least four participants (Cruz, Rubio, Kasich, and Trump). So, in order to get a true understanding of who talks the most, we need to normalize the data such that the measurement indicates the ratio of number of words spoken relative to the average number of words spoken in the debates in which each candidate has participated (in other words, the number of words one can expect a candidate to say). The report below shows this normalization.
This is an interesting report because Clinton overall spoke the most words yet Trump has by far the highest normalized word count. In other words, relative to the amount that one can expect a candidate to speak in a debate, Trump, Rubio, and Clinton got much more than their fair share of time. Those at the top of the y-axis in this report are the front-runners in their respective primaries/caucuses (except for Rubio who ended his campaign in late March). The fact that Trump had the highest overall normalized word count may be an indication of his dominance in the overall candidate field. On the other hand, comparing this report to the relationship cloud above, when candidates mention other candidates they seem to be mentioning Clinton most of the time. This is an indication that the other candidates see Clinton as dominant and, therefore, they need to clearly demonstrate their own differences to her platform.
Context in NLP Development
Word clouds, linguistic relationships, and word counts are all very interesting and each contains important information about the debates, parties, and candidates, but in order to understand more about each of the candidates, we need to start looking at how they use language. For the following reports we removed stop words and then lemmatized each word (found the root form of the word). The following report shows the vocabulary size (or number of distinct words used) of each candidate. To make things more clear, we only show the candidates who were still in the running before the May 3 primaries/caucuses.
With a straight count of distinct words there may be a bias due to the amount of time each candidate was allowed to talk relative to the number of candidates in the debate and the style of each debate. To help overcome this bias, we can plot a chart with the vocabulary size on the y-axis and the count of words on the x-axis and then fit a line to the data points using linear regression. In the chart below, you can see that Cruz’s vocabulary is well above expectation, Clinton’s vocabulary is above expectation, Sanders’ vocabulary is at or slightly below expectation, and Trump and Kasich have vocabularies that are below expectation.
This report further accentuates the fact that Clinton and Trump do most of the talking in their respective debates, yet Trump’s vocabulary is nowhere near expectation.
ML in NLP Development
Lastly, in order to understand which topics the candidates prefer to talk about, we ran k-means clustering in R. The clustering method groups the sentences into unique categories based on the frequency counts of the linguistic relationships (as in the relationship cloud above). The categories are not predefined, so this method allows us to automatically detect major topics in the debate without any prior knowledge. The content of the random clusters will be used to analyze and discover unknown trends in the debate.
Our clustering method grouped the sentences into 15 clusters based on the frequency counts of a set of common and recurring linguistic relationships. The number 15 was chosen to minimize the processing time. The optimal number of clusters is expected to be much larger than 15, but this set of clusters is a very good start to demonstrate what is possible.
Clustering unstructured text data has unique challenges because the data consists of many non-numeric values such as lemmas, part of speech tags, and linguistic relationships. We converted these values to frequency counts so that they can be used as numeric variables in statistical models. However, this results in a very large data set with highly correlated data. We used several methods to reduce the number of variables and we are currently exploring additional ways to further reduce the number of variables.
We removed any linguistic relationship that contained one of the top 100 most common words in the English language. Because the goal of clustering is to identify topics in the data, we did not want to include variables for common/frequent linguistic relationships. When we removed these relationships the clusters became more meaningful.
Interpreting the NLP & ML Results
Next, we inspected the statistical distribution of the linguistic relationship frequencies. This histogram shows the frequency of linguistic relationships on the x axis, and a probability density measure on the y axis. The shape of this distribution shows that the majority of the linguistic relationships in the data occur less than 5 times. Assuming a relationship that appears in the data a few times will not be a useful variable, we reduced the linguistic relationships again by only using the relationships with frequencies that fell to the right of the mean.
The following plot only shows two of the clusters plotted against principal components that summarize the variables. The topic of cluster 15 is the economic growth, and the topic of cluster 12 is jobs. This illustrates how clustering can be used to automatically detect information without any previous knowledge of the content of your data. Sentiment analysis of these clusters would reveal information about the candidate’s view of the economy and job creation.
The following table provides a quick summary of the content of the clusters. The clusters are listed in no particular order. There is noise in the data, but some clusters have useful topics. These clusters can be used as starting points for a categorization model into which the sentences can be classified.
|1||The President of the United States|
|3||The middle class|
|6||The American people|
|7||Money and college tuition|
|8||Very hard (difficult)|
|9||A trillion dollars|
|11||A billion dollars|
The topic of cluster 14 is beating Hillary Clinton. The sentences in this cluster are candidates stressing that they can beat Hillary Clinton in the polls and presidential election. This cluster indicates again that the other candidates in the race view Hillary as their main competition for the presidency.
NLP Development Next Steps
So where is the business value in doing text analytics via NLP development on presidential debate data? These analyses and reports serve to show the possibilities that exist when dealing with any kind of data. High level reports such as word clouds help you see overall themes and give you a starting point to your analysis. These higher level reports lead to questions about more granular aspects of the data. As we get more granular (such as in the word count, normalized word count, and vocabulary size reports and later with clustering), certain aspects of the data begin to emerge that we may not have considered otherwise. For example, the fact that Trump does the majority of the talking but uses a smaller vocabulary leads us to believe that he may be speaking to a certain demographic with a less sophisticated vocabulary (since we know that he is highly educated, this means that his true vocabulary is likely much larger than what he uses during the debates). We could keep going to deeper and deeper levels of NLP development, but these kinds of results serve to show what is possible. Your opportunity is to understand what your customers are saying about you and to you. This will help you improve their experiences with you.
In the near future, we look forward to further enriching this analysis with sentiment and also improving our clustering efforts. Both of these will help shed more light on the content of this data.