What is Big Data?

What is Big Data? Viktor Mayer-Schönberger explains: Rather a new perspective on reality than a new technology. It can accelerate processes of gaining insights but has also its problematic aspects.

Accelerating the Human Cognitive Process

Using Internet searches to predict the spread of flu; predicting damage to aircraft engine components; determining inflation rates in real-time; catching potential criminals before they even commit the crime: The promises of big data are as astounding as they are complex. Already, an army of service providers have specialized in providing us with big data‘s „benefits“ - or competently protecting us from them. A lot of money will be made based on this advice, but what big data is exactly remains largely unclear.

Many may intuitively equate the term „big data“ with huge amounts of data to be analysed. It is undoubtedly true that the absolute amount of data in the world has increased dramatically over the past decades. The best available estimate assumes that the total amount of data has increased a hundredfold in the two decades from 1987 to 2007.^[1] By way of comparison, historian Elisabeth Eisenstein writes that in the first five decades after Johannes Gutenberg invented a movable-type printing system, the amount of books in the world roughly doubled. ^[2] And the increase in data is not letting up; at present, the amount of data in the world is supposed to double at least every two years. ^[3] A common idea is that the increase in the quantity of data will at some point lead to improved quality. However, it seems doubtful that an increase in quantity of data alone will lead to the big data phenomenon that is expected to profoundly change our economy and society.

[...]

The fundamental characteristics of big data may become clearer if we understand that it allows us to gain new insights into reality. Big data is therefore less a new technology than a new, or at least significantly improved, method of gaining knowledge. Big data is associated with the hope that we will understand the world better – and make better decisions based on this understanding. By extrapolating the past and present, we expect to be able to make better predictions about the future. But why does big data improve human insight?

Big Data

Method of gaining insight on the basis of quantitative data by building statistical correlations and relations (between a variety of data types and a massive amount of data). Facilitated by algorithmic computing. Via modelling social reality through statistical approximation, the fundamental aim of big data is to forecast human behaviour, to understand societal processes, or to influence human activities.

Relatively More Data

In the future, we will collect and evaluate considerably more data relative to the phenomenon we want to understand and the questions we want to answer. It is not a question of the absolute volume of data, but of its relative size. People have always tried to explain the world by observing it, and as a result, the collection and evaluation of data is deeply connected with human knowledge. But this work of collecting and analysing data has always involved a great deal of time and expense. Consequently, we have developed methods and procedures, structures and institutions that were designed to get by with as little data as possible.

In principle, this makes sense when few data points available, but it has also led to terrible mistakes in some cases. Random sampling as a proven method for drawing conclusions with relatively few data points has been available to us for less than a century. Its use has brought about great progress, from quality control in industrial production to robust opinion polls on social issues, but random sampling remains a Band-Aid solution, lacking the density of detail needed to comprehensively depict the underlying phenomenon. Thus, our knowledge based on samples inevitably lacks detail. Typically, using random samples only allows us to answer questions that we had in mind from the very beginning, so knowledge generated from samples is at best a confirmation or refutation of a previously formulated hypothesis. However, if handling data becomes drastically easier with time, we will more often be able to collect and evaluate a full set of data related to the phenomenon we want to study. Moreover, because we will have an almost complete set of data, we will be able to analyse it at any level of detail desired. Most importantly, we will be able to use the data as inspiration for new hypotheses that can be evaluated more often and without having to collect new data.

The following example makes this idea clear: Google can predict the spread of flu using queries entered into its search engine. The idea being that people usually seek information about the flu when they themselves or people close to them are affected by it. A corresponding analysis of search queries and historical flu data over five years did indeed find a correlation ^[4]. This involved the automated evaluation of 50 million different search terms and 450 million combinations of terms; in other words, almost half a billion concrete hypotheses were generated and evaluated on the basis of the data in order to select not just one, but the most appropriate hypothesis. And because Google stored not only the search queries and their date but also where the query came from, it was ultimately possible to derive geographically differentiated predictions about the probable spread of the flu ^[5].

In a much-discussed article from several years ago, the then editor-in-chief of Wired, Chris Anderson, argued that the automated development of hypotheses made human theory-building superfluous ^[6]. He soon revised his opinion, because as much as big data is able to accelerate the process of cognition in the parametric generation of hypotheses, abstract theories are not very successful. Humans therefore remain at the centre of knowledge creation. Consequently, the results of every big data analysis are interwoven with human theories and thus, also with their corresponding weaknesses and shortcomings. So even the best big-data analysis cannot free us from resulting possible distortions ^[7]. In summary, big data not only confirms preconceived hypotheses, but also automatically generates and evaluates new hypotheses, accelerating the cognitive process.

On Quantity and Quality

When little data is available, special care must be taken to ensure that the data points collected accurately reflect reality, because any measurement error can falsify the result. This is particularly serious if all data come from a single instrument that is measuring falsely. With big data, on the other hand, there are large collections of data that can be technically combined relatively easily. With so many more data points, measurement errors for one or a handful of data points are much less significant. And if the data come from different sources, the probability of a systematic error decreases.

At the same time, more data from very different sources leads to new potential problems. For example, different data sets may measure reality with different error rates or even depict different aspects of reality, making them not directly comparable. If we were to disregard that and subject them to a joint analysis anyway, we would be comparing apples with oranges. This makes it clear that neither a highly accurate, small amount of data points nor a diversely-sourced, large amount of data are superior to the other. Instead, in the context of big data, we are much more often faced with with a trade-off when selecting data. Until now, this goal conflict has rarely arisen as the high cost of collection and evaluation mean we typically collect little data. Over time, this has led to a general focus on data quality.

To illustrate this, in the late 1980s, researchers at IBM experimented with a new approach to automated machine translation of texts from one language to another. The idea was to statistically determine which word of one language is translated into a specific word of another language. This required a training text that was available to researchers in the form of the official minutes of the Canadian Parliament in the two official languages, English and French. The result was astonishingly good, but could hardly be improved upon subsequently. A decade later, Google did something similar using all the multilingual texts from the Internet that could be found, regardless of the quality of the translations. Despite the very different — and on average probably lower — quality of the translations, the huge amount of data produced a much better result than IBM had achieved with less but higher quality data.

The End of Causal Monopolies

Common big data analyses identify statistical correlations in the data sets that indicate relationships. At best, they explain what is happening, but not why. This is often unsatisfactory for us, as humans typically understand the world as a chain of causes and effects.

Daniel Kahneman, Nobel Prize winner for economics, has impressively demonstrated that quick causal conclusions by humans are often incorrect ^[8]. They may give us the feeling of understanding the world, but they do not sufficiently reflect reality and its causes. The real search for causation, on the other hand, is usually extraordinarily difficult and time-consuming and, especially in complex contexts, is only completely successful in select cases. Despite a considerable investment of resources, this difficulty in identifying causation has led us to only sufficiently understand causality when analysing relatively less complex phenomena. Moreover, considerable errors creep in simply because researchers identify their own hypotheses and only set out to prove their ideas.

[...]

Big data analysis based on correlations could offer advantages here. For example, in the data on the vital functions of premature babies, the health informatics specialist Carolyn McGregor and her team at the University of Toronto have identified patterns that indicate a probable future infection many hours before the first symptoms appear. McGregor may not know the cause of the infection, but the probabilistic findings are sufficient to administer appropriate medication to the affected infants. Although perhaps not necessary in some individual cases, in the majority it saves the life of the infant and is therefore the pragmatic response to the data analysis, especially because of the relatively few side effects.

On the other hand, we have to be careful not to assume that every statistical correlation has a deeper meaning, as they also may be spurious correlations that do not reflect a causal connection.

Findings about the state of reality can also be of significant benefit for research into causal relationships. Instead of merely exploring a certain context on the basis of intuition, a big data analysis based on correlations allows the evaluation of a large number of slightly different hypotheses. The most promising hypotheses can then be used to investigate the causes. In other words, big data can help to find the needle of knowledge in the haystack of data for causal research.

This alone makes it clear that big data will not stop people from searching for causal explanations. However, the almost monopolistic position of causal analysis in the knowledge process is diminishing as the what before the why is more often prioritized.

Approximation of Reality

In 2014, science magazines around the world reported an error in Google‘s flu prediction. In December 2012 in particular, the company had massively miscalculated its forecast for winter flu in the U.S., and far too many cases had been predicted ^[9]. What happened? After a thorough error analysis, Google admitted that the statistical model used for the flu forecast had been left unchanged since its introduction in 2009.However, because people‘s search habits on the Internet have changed over the years, the forecast was misleading.

Google should have known that. After all, the Internet company regularly updates many other big data analyses of its various services using new data. An updated version of the forecast, based on data up to 2011 resulted in a much more accurate forecast for December 2012 and the following months.

This somewhat embarrassing mistake by Google highlights another special feature of big data. Until now, we have tried to make generalizations about reality, which should be simple and always valid, but in doing so, we have often had to idealize reality. In most cases this was sufficient. However, by trying to understand reality in all its detail, we are now reaching the limits of idealized conceptions of the world. With big data it becomes clear that with idealized simplifications we can no longer grasp reality in all its diversity and complexity, but must understand each result of an analysis as only provisional.

Accordingly, we gratefully accept each new data point, hoping that with its help, we will come a little closer to reality. We also accept that complete knowledge is escaping us, not least because the data is always merely a reflection of reality and thus incomplete.

(Economics) Primacy of Data

The premise of big data is that data can be used to gain insights into reality. Therefore, it is primarily the data, not the algorithm, that is constitutive for gaining knowledge. This is also a difference to the „data poor“ past. When little data is available, the model or algorithm holds greater weight, as it must work to compensate for the lack of data. This also has consequences for the distribution of informational power in the context of big data. In the future, less power will be given to those who merely analyse data than to those who also have access to the data itself. This development will ground in fact the unease of many people towards organizations and companies that collect and evaluate ever larger amounts of data.

Because knowledge can be drawn from data, there are massive incentives to capture more and more aspects of our reality in data. In other words – to coin a phrase – to increasingly „datify“ reality. [. . . ] If the costs of evaluation and storage decrease, then it suddenly makes sense to keep previously collected data on-hand and to reuse it for new purposes in the future. As a result, from an economic point of view, there are also massive incentives to collect, store and use as much data as possible, without apparent reason, since data recycling increases the efficiency of data management.

Big data is a powerful tool for understanding the reality in which we live, and those who use this tool effectively benefit from it. Of course, this also means the redistribution of informational power in our society – which brings us to the dark side of big data.

Permanence of the Past, Predicted Future

Since Edward Snowden‘s revelations about the NSA‘s machinations, much has been written about the dangers of big data. The first thing usually mentioned is comprehensive monitoring and data collection, but the threat scenario goes beyond the NSA.

If simple availability and inexpensive storage encourage unlimited data collection, then the danger exists that our own past will catch up with us again and again ^[10]. On the one hand, it empowers those who know more about our past actions than we ourselves can perhaps remember. If we were then regularly reproached for what we said or did in earlier years, we might be tempted to censor ourselves, hoping that we would not run the risk of being confronted with an unpleasant past in the future. Students, trade unionists and activists might feel compelled to remain silent because they might fear being punished for their actions in the future or at least treated worse. According to psychologists, holding on to the past also prevents us from living and acting in the present. This is how literature describes the case of a woman who cannot forget and whose memory of every day of the past decades blocks her in her decisions in the present.^[11]

In the context of big data, it is also possible to forecast the future based on analyses of past or present behaviour. This can have a positive impact on social planning, for example when it comes to predicting future public transportation flows. However, it becomes highly problematic if we start to hold people accountable on the basis of big data predictions about future behaviour alone. That would be like the Hollywood film „Minority Report“ and would call into question our established sense of justice. What is more, if punishment is no longer linked to actual but merely predicted behaviour, then this is essentially also the end of social respect for free will.

Although this scenario has not yet become reality, numerous experiments around the world already point in this direction. For example, in thirty states in the United States, big data is used to predict how likely it is that a criminal in prison will re-offend in the future, and thus, to decide whether or not they will be released on parole. In many cities in the Western world, the decision of which police patrols operate and where and when they do is based on a big data prediction of the next likely crime. The latter is not an immediate individual punishment, but it may feel like it for people in high-crime areas when the police knock on the door every evening, even if just to ask nicely whether everything is alright.

What if big data analysis could predict whether someone would be a good driver before they even pass their driving test? Would we then deny such predicted bad drivers their licences even if they could successfully pass the test? And would insurance companies still offer these people a policy if the risk was predicted to be higher? At what conditions?

All these cases confront us as a society with the choice between security and predictability on the one hand and freedom and risk on the other. But these cases are also the result of the misuse of big data correlations for causal purposes — the allocation of individual responsibility. However, it is precisely this necessary answer to the why that the analysis of the what cannot provide. Forging ahead anyway means no less than surrendering to the dictatorship of data and attributing more insight to big data analysis than is actually inherent in it.

References

↑ Martin Hilbert/Priscilla López, The World’s Technological Capacity to Store, Communicate, and Compute Information, in: Science, 332 (2011) 6025, p. 60–65.
↑ Elizabeth L. Eisenstein, The Printing Revolution in Early Modern Europe, Cambridge 1993, p. 13f.
↑ John Gantz/David Reinsel, Extracting Value from Chaos, 2011, (24.2.2015),
↑ Jeremy Ginsburg et al., Detecting Influenza Epidemics Using Search Engine Query Data, in: Nature, 457 (2009), S. 1012ff
↑ Andrea Freyer Dugas et al., Google Flu Trends: Correlation With Emergency Department Influenza Rates and Crowding Metrics, in: Clinical Infectious Diseases, 54 (2012) 4, S. 463–469.
↑ Chris Anderson, The End of Theory, in: Wired, 16 (2008) 7, https://www.wired.com/science/discoveries/magazine/16-07/pb_theory« (24.2.2015).
↑ danah boyd/Kate Crawford, Six Provocations for Big Data, Research Paper, 21.9.2011, https://ssrn.com/abstract=1926431 (24.2.2015).
↑ Daniel Kahneman, Schnelles Denken, langsames Denken, München 2012.
↑ David Lazer/Ryan Kennedy/Gary King, The Parable of Google Flu: Traps in Big Data Analysis, in: Science, 343 (2014) 6176, pp. 1203.
↑ More in detail: Viktor Mayer-Schönberger, Delete – Die Tugend des Vergessens in digitalen Zeiten, Berlin 2010.
↑ Elizabeth S. Parker/Larry Cahill/James L. McGaugh, A Case of Unusual Autobiographical Remembering, in: Neurocase, 12 (2006), p. 35–49

Viktor Mayer-Schönberger

Professor of internet governance and regulation at the Oxford Internet Institute. His research focuses on the role of information in a networked economy.

Aus Parlament und Zeitgeschichte

This text is a shortened, translated and author-approved version of the article, Was ist Big Data? Zur Beschleunigung des menschlichen Erkenntnisprozesses, by Viktor Mayer-Schönberger. It was published originally in: Aus Politik und Zeitgeschichte/bpb.de 6.3.2015 in German, under a Creative Commons Attribution-Non Commercial 3.0 License, which permits non-commercial use, reproduction and distribution of the work.

The Internet, Big Data & Platforms

This text was published in the frame of the project DIGIT-AL - Digital Transformation Adult Learning for Active Citizenship.

Zimmermann, N.: The Internet, Big Data & Platforms (2020). Part of the reader: Smart City, Smart Teaching: Understanding Digital Transformation in Teaching and Learning. With guest contributions of Viktor Mayer-Schönberger, Manuela Lenzen, Irights.Lab and José van Dijck and contributions of Elisa Rapetti and Marco Oberosler. DARE Blue Lines, Democracy and Human Rights Education in Europe, Brussels 2020.

From:

DIGIT-AL Toolbox

www.dttools.eu

[hilbert_lopez-1] Martin Hilbert/Priscilla López, The World’s Technological Capacity to Store, Communicate, and Compute Information, in: Science, 332 (2011) 6025, p. 60–65.

[eisenstein-2] Elizabeth L. Eisenstein, The Printing Revolution in Early Modern Europe, Cambridge 1993, p. 13f.

[gantz_reinsel-3] John Gantz/David Reinsel, Extracting Value from Chaos, 2011, (24.2.2015),

[ginsburg-4] Jeremy Ginsburg et al., Detecting Influenza Epidemics Using Search Engine Query Data, in: Nature, 457 (2009), S. 1012ff

[freyer-5] Andrea Freyer Dugas et al., Google Flu Trends: Correlation With Emergency Department Influenza Rates and Crowding Metrics, in: Clinical Infectious Diseases, 54 (2012) 4, S. 463–469.

[anderson-6] Chris Anderson, The End of Theory, in: Wired, 16 (2008) 7, https://www.wired.com/science/discoveries/magazine/16-07/pb_theory« (24.2.2015).

[boyd-7] yd/Kate Crawford, Six Provocations for Big Data, Research Paper, 21.9.2011, https://ssrn.com/abstract=1926431 (24.2.2015).

[kahmann-8] Daniel Kahneman, Schnelles Denken, langsames Denken, München 2012.

[lazervkennedy_king-9] David Lazer/Ryan Kennedy/Gary King, The Parable of Google Flu: Traps in Big Data Analysis, in: Science, 343 (2014) 6176, pp. 1203.

[vms-10] More in detail: Viktor Mayer-Schönberger, Delete – Die Tugend des Vergessens in digitalen Zeiten, Berlin 2010.

[parker-11] Elizabeth S. Parker/Larry Cahill/James L. McGaugh, A Case of Unusual Autobiographical Remembering, in: Neurocase, 12 (2006), p. 35–49

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Navigation menu