Red flags for data driven electoral trends

“We can know only that we know nothing. And that is the highest degree of human wisdom.” ― Leo Tolstoy

The original title for this article was "A data-driven analysis of elections" for which, with meticulously collected data from Statistics Iceland, I began plotting trends and fitting them for the article. Sadly, scientific integrity brought those plans down in flames. In good faith, I cannot model trends over time for events that are almost completely causally uncorrelated at the level of detail amassed for this short article. My colleagues at the University have a strong pedigree of statistical studies and social sciences which cannot and should not be undermined by a handful of plots because of "big data" and ease of modeling.

“There are three types of lies – lies, damned lies, and statistics.” 

― Benjamin Disraeli

Though part of the cultural hive mind since the early 20th century (often attributed to Mark Twain ~1905) it also happens to predate the statistics of the modern era; computer-aided and bowed by the burden of actionable predictions. The growth of statistical data and recordkeeping have hounded human history; mostly as a book-keeping mechanism, sometimes for financial risk accounting, but it was not until manifold improvements in the mathematical foundations of statistics in the latter half of the 20th century that statistical results could be relied upon even slightly.

Fast forward to the Information age, through the advances of applied mathematicians, computational physicists, chemists and engineers, and we find ourselves in the era of modern computer-aided statistics; accurate enough to predict natural calamities like earthquakes, volcanic eruptions, glacial activity and more. Modeling large data trends with appropriate caution has influenced policy and saved countless lives. Integrity is key, as is the often-overlooked meaning of "appropriate caution."

A line chart

A smoothed line chart

With an influx of easily available data and a media frenzy focused on the "magic" of machine learning, it might seem that since Statistics Iceland provides historical records over time, going back to when there was naught to politics but a royal decree, it should be fun and easy to plot trends, and with enough plots and statistical models surely one will hit some percentage of the true results?

The answer is yes. It is possible. In the manner by which a thousand monkeys at a thousand typewriters can write a best-selling novel. In other words, taking some numbers, manipulating them with incredibly flexible models, and then spitting out a number that is "close to" another number has no inherent veracity.

“The purpose of computation is insight, not numbers.”

 ― Richard Hamming

Unfortunately, it is impossible to gain insight into social processes without enough a-priori understanding of the dependent parameters in a model.

To perform an analysis of election data without cultural information is never a feasible endeavor. At the same time, being completely divorced from any social, moral, or economic interpretation of data can be thought of as an appeal to an impartial observer. It is in this second sense that this article was conceived. The trends collected are from public datasets, and neglect even the most cursory of contextual trappings.

  1. What relevance do election results before the advent of the mass media have on an election in the next few weeks? None

  2. What relation do the party lines have with their distant past? Some historical relevance?

  3. Given the changes in population homogeneity, literacy rates, the changing proximity of church and state, can data concerning party X being elected in year Y be a true measure of party X's future success? Nope

This is not to say there is no way to predict outcomes. If the entire population can be polled regularly, and the social pulse of the country is gauged accurately, then the act of voting itself becomes perfunctory, and an approval metric can be accurately predicted. The biggest issue endemic to such solutions is that the polling assumptions fail; either the samples are too small or not representative of the large population. It is almost always then, a problem of either comparing apples to oranges, or believing there to be only apples and oranges when there exists an unsampled pineapple majority.

There are ways to gain insights from the data at Statistics Iceland. I am a huge advocate for the digital part of the digital humanities, and certainly, with expert elicitation and enough time, historical trends can be linked to numbers and shown to manifest in electoral trends. Consider the case of the economist who tracks the financial crisis, along with the mood of the country (via mass media) and predicts that the incumbent government might fall in the next election.

A foreigner here, both culturally and intellectually (in that politics are not my field, and I do not speak of them freely, plus I don't speak Icelandic), I could not have analysed the Statistics Iceland data in tandem with local news any better than I could swim in Fagradalsfjall unscathed.

A basic pie chart

A better pie chart

Some other non-exhaustive indications of a statistical analysis gone wrong are:

Evasive pie charts

These are computed in terms of proportions which is not a natural way to think of data.

Scaling concerns

Less common in election data, but still, always understand each axis and whether the relationship (linear or otherwise) is logical.

Very good fits

Almost always, a perfect fit for data is a lie or someone selling a lie.

Smoothed curves

Very commonly done because reading jittery plots is hard, it can drive incorrect conclusions, especially when discrete data is smoothed in an unphysical manner.

No logical correlation analysis

Always a sign that the causal inference was inconclusive.

Consider, however a set of simple graphs related to the total percentage of voters in the last couple of parliamentary elections in Figure 1, and how the data visualizations and fallacies look like in real life. Note that we won't provide any logical analysis, and our nice line plots are a perfect "fit" to the points, so that's two down already.
For an overview of charts, data-to-viz.com is fantastic.

A bar chart

I end with a simple appeal to human empathy. To truly derive insight from statistics requires communication. Data science alone cannot replace expert knowledge nor can data scientists expect their colleagues to understand the "obvious" modeling assumptions. Similarly, domain specialists cannot assume unbiased predictions. Every statistic, every graph, every plot is a compromise between absolute truth, human interpretation, digital discretization, and numerical inaccuracy. Remember that:

“Most people use statistics like a drunk man uses a lamppost; more for support than illumination” ― Andrew Lang

For those looking for a simple, rigorously correct answer, I shall offer "42" and my condolences.

OtherRohit Goswami