A data-driven look into editorials about the 2020 Democratic Primary

6 min readApr 14, 2020

--

I was curious about how editorials may have shaped the discussion about the 2020 Democratic Primary. These often strongly-worded pieces are read by millions of people daily and their writers enjoy large followings on social media.

As a disclaimer: I am a neuroscientist who works with large datasets (including text) and once spent three years in law school where I read a lot. Those are my credentials. But because I am sensitive to concerns about “armchair” analysis, as recently described by my friend and colleague Tim Requarth, I took a number of prophylactic measures to avoid making some common missteps.

I will not be using any complex models or making future predictions. The only math required to complete this analysis is counting.
I will not be making any grand claims about the historical arc of political journalism or the evolution of language. Instead, I will present very basic graphs that are intended to be illustrative rather than comprehensive.
I will not be making any recommendations. While I do give my own interpretation of the numbers here, I will not use the data to challenge conventional wisdom or urge for any kind of global response.

Here is the one key assumption that I am making:

Here’s how I could be wrong: Nobody reads editorials from the front page, they read editorials that are sent to them by their family members. Or, editorials that have a candidate’s name in the title have no connection to that candidate. The first seems unlikely and the second can easily be verified by reading a few editorials. So assuming purple quote box is accurate, here are a few questions you might ask about the data.

How many op-eds about each candidate were featured on the front pages of either the NYT or the Washington Post?

**Number of editorials (op-eds) linked on the front pages of Washington Post and NYT, every day from January 1st to March 15th.** The bar graphs show the total number of articles separated by publication type. The scatter plots show the total number of op-eds about each candidate with some key primaries labeled in white (Iowa, Nevada, SC, or Super Tuesday). Best-fit lines are displayed for showing trends.

In comparison, how many news stories about each candidate were featured on the front pages of either the NYT or the Washington Post?

This data is already interesting. The NYT featured almost as many editorials about Sanders (46) as the other three candidates combined (49). In total, Sanders had 32 more editorials about him than Biden. While these editorials came at a steady clip for Sanders, most of Biden’s editorials were written in the week of the South Carolina primary and leading up to Super Tuesday. In terms of news stories, Biden and Sanders were similarly likely to be headlined. Warren, on the other hand, was the topic of few editorials and news stories. Bloomberg only garnered attention in the weeks leading up to Super Tuesday, but still managed to have more written about him than did Warren.

What types of words appeared in the Op-Eds for the candidates?

From here, I limit the analysis to comparisons between Biden and Sanders (two front-runners). To me, the only other interesting comparison would be with Sanders and Warren, but she simply did not have enough written about her for the result to be meaningful.

A word about wordclouds: they are often made from the most common words that appear in a text. But for texts with highly overlapping content, like articles about political candidates, these words won’t be informative because they will appear in all of the articles and cannot be used to distinguish two candidates. Imagine “democrat”, “primary”, “victory”, so on.

My solution: produce a wordcloud for words frequently used in articles about one candidate but infrequently used in articles about the other candidate (for lexical analysis aficionados this uses the logic of tf-idf).

**A comparison of the words appearing in editorials about each candidate.** The size of the word corresponds to how many times it appeared in the combined text of all the editorials. The color of the text corresponds to whether it is “positive” or “negative” as determined by two standard dictionaries used for textual analysis.

Some comments: For the analysis, words are first converted into their singular form, as in the case of “Jews”. None of the editorials use the word “Jew”, but the words “Jewish” or “Jews” do appear in about 1 out of 10 articles about Sanders. These articles were either using “Jewish” to describe Sanders as Jewish or describing concerns that Jews had about his policies. For comparison, Bloomberg — who is also Jewish — only had that fact mentioned once in the 36 editorials about him.

Additionally, a shortcoming of the standard methods for assigning “sentiment” to words is that they don’t consider political context. For example, the adjectives “Soviet”, “Russian”, and “socialist” could be quite negative depending on the reader. On the other hand “bipartisan” and “incumbent” are words that are not labeled as positive by the standard dictionaries.

But what about the topics?

As just a quick refresher of high school grammar, a proper noun is a “person, place, or thing in capital letters”. Generally, proper nouns form topics of discussion. Here you might expect to find terms like “Medicare” or “Iraq”. But the words that appear are actually quite different. From the proper nouns appearing in Sanders’ cloud, you might think he was running in a different time: many of the proper nouns in articles about him focused on words associated with communism (“Cuba”, “Castro”, “Soviet”, “Sandinista”, “Nicaragua”, “Castro”) or words associated with historical figures and events (“Ronald”, “Nixon”, “Roosevelt”, “Vietnam”).

How many articles used these words and when were they published?

Based on the results that came from the wordclouds above, I used an algorithm to find the smallest set of words that are present in the largest fraction of one candidate’s articles and the smallest fraction of the other candidate’s articles. This can be used to see when candidates get their “brand” so to speak.

**The timing and quantity of op-eds using certain select words** The two heat maps here display the number of op-eds written that contain the word to the left of the row. A dark line in the heat map for Sanders is used to illustrate the timing of two key primary contests (Iowa and Super Tuesday). All op-eds to the left of the line were published before or on the day of the primary.

For Sanders, at least one of the words in his list could be found in 7 out of 10 articles about him, and in only 3 out of 10 articles about Biden. For Biden, at least one of the words in his list could be found in 7 out of 10 articles about him, and in only 4 out of 10 articles about Sanders. Because Biden needed a longer list of words, it suggests that there was less of a consistent description of him emerging in all of the editorials.

One apparent trend is that the number of op-eds with Sanders’ words ramps up in the weeks leading to Iowa and then prior to Super Tuesday. The set of words used to describe Biden does not take shape until after Super Tuesday.

That’s all for now. I may follow up with another article if people have suggestions for additional analyses. Until then, it’s back to analyzing molecules and cells for me!

Acknowledgments

All of the software I used for this analysis is free and open-source, so some acknowledgments are in order. First, none of this would have been possible (or at least it would have been harder) without the Internet Archive. I will donate to them now and you should too if you can. Second, I am grateful to the many creators and maintainers of open-source Python packages that I used including seaborn, nltk, pandas, numpy, wordcloud. They all work perfectly all of the time :)