S3 Home‎ > ‎Educating the User‎ > ‎

Understanding "Big Data"

Minimalist Description:
 Strangely enough, "Big Data" isn't about how big the data is, it's about how you use it. If you're processing it the same 'ol way, then it might be Lotsa' Data but it's not Big Data. The thumb print of a Big Data problem is that you are compelled to handle it differently than you otherwise would. Maybe there's enough data that you can divine a model from it rather than establishing a model from domain expertise. That's kind of what Google Translate does: with a big enough corpus of words, you can figure out the syntax of the language. Maybe there's enough data that you can't afford to scrub it, and with that much data, the errors will get lost in the noise anyway. 

The left index fingerprint is that the data is used for some purpose no one thought about at the time the data was collected. This re-purposing leads to lots of interesting issues. It may mean that if you are clever enough and can think of a new, value-generating use for some storehouse of existing data, you can become a "Big Data" millionaire. It also means that the data probably isn't the best fit for the purpose it is being (re) used for, so it is going to be messy, and you may have to overwhelm that imprecision with massive quantities of data, which may in turn overwhelm standard computing and force you into using highly specialized software and computers to get the job done. That could cut into your millions.

And when you are done, the result--in part because of the messiness and in part because of the magnitude, and in part because you've teased out complex subtleties that can only be seen with huge amounts of data--may not be traceable to cause-and-effect. You will probably be forced to accept whatever unintuitive relationships the data suggests without a descriptive model to explain what is happening and without ever understanding why those relationships exist.

And because no causality relationship is forthcoming, the relationships may change without warning, exposing you to risks that you need to prepare for. We can help to assess those risks, identify places where they might arise, and protect you from some of the consequences. We can offer validation schemes that help prevent over-fitting (there are countless models generated by careless "Big Data" techniques that predict the stock market's past history with great accuracy but are no better than a roulette wheel at predicting its future). 

This web site says a lot about data science, and that leads to inevitable questions about "Big Data". To begin with, Big Data is only a small part of data science. Most data science issues are only peripherally related to Big Data, but much of the confusion about data science has to do with Big Data. The purpose of this web page is to address some of those confusions, help you figure out whether your problem is a Big Data problem, and if so, what the consequences of that are for your enterprise. 

As a matter of background, S3 Principal Investigators Jamie Lawson and Rajdeep Singh, and our consultant Gene Hubbard were all early members of Lockheed Martin's Big Data research program, and in 2011, Jamie and Gene wrote Lockheed's first major Big Data state-of-the-science report, for which Rajeep was a reviewer. At that time, many people, even many of the technologists we interacted with daily, equated Big Data with MapReduce, the algorithm Google introduced in 2004 (though people often confuse the MapReduce algorithm with the Hadoop program that was released in 2007). These were confusions then, and they remain confusions now. We hope to clarify that and other misunderstandings here.

What is "Big Data" anyway?

Most discussions of Big Data center on the "3 Vs": volume, variety, and velocity. Said differently, there is either too much data, or it is coming too quickly, or it is too unstructured to handle through standard methods and thus we resort to special "Big Data" methods to process it. This is a useful view of Big Data. Gorgon Stare provides an ideal example. Gorgon Stare is a sensor developed by the United States Department of Defense for Wide Area Motion Imagery (WAMI). The sensor is intended to report every vehicle in a 100 square mile area at 15 frames per second with each image pixel representing 6 inches on the ground. That's a 100,000x100,000 or a 10,000 megapixel focal plane, refreshing every 70 milliseconds. The data volume is massive! The velocity is overwhelming, and the imagery is varied and unstructured; individual targets need to get isolated and tracked. Gorgon Stare is a real Big Data problem. But while the 3 Vs give some insight, what constitutes "Big" varies a lot with time, and we'd strongly prefer a way of understanding Big Data that was a little more stable. In the early years of the Human Genome Project, the 3.3 gigabase genome seemed as overwhelming as Gorgon Stare seems today. Today, we use many of the same processing techniques like Smith-Waterman and BLAST that were used back in the day, but we can carry thousands of human genomes on a thumb drive. So is it a Big Data problem? Perhaps in a few years we'll monitor all of the Gorgon Stare data live on our future smart phones, so is that really a Big Data problem?

As for MapReduce, that algorithm is one way to deal with excess volume. But it is inappropriate for near realtime problems like Gorgon Stare which also need to deal with excess velocity. There are techniques like Reactive Programming for high velocity, though Gorgon Stare is still on the outer reaches of what is imaginable. There are other tools like column family storage for dealing with variety, though in Gorgon Stare's case, we also need more advanced ways of doing signal processing.

Typically, in volume-limited Big Data problems, the data is too big to keep in one place. Mostly for historical reasons, the standard way to do data processing is to move the data to where the program code is running, and process it there. But in many cases, the algorithms are much smaller than the data; MapReduce allows you to move the code to where the data is. If you have enough volume, this is a big win. This is especially useful if there is too much data to keep in one place, and you can cheaply move a copy of the code to each place where part of the data is. But fundamentally, there is nothing unique about MapReduce. The S3 team has considerable experience moving code to data and has delivered systems that do this since the the late-1990s. The map and reduce operations were well known in the 1950s. They are "list comprehensions" and appear in the early literature on functional programming. They arise in a field of mathematics known as Catagory Theory, which dates to the 1940s. And due to advances in hardware, the problems Google invented MapReduce to solve could in many cases be solved far more effectively today on a laptop computer using standard tools. MapReduce helps to cope with data volumes that are not suitable for standard methods, but what we mean by too much volume changes with technology. The same is true for velocity. And what was too fast five years ago is fairly mundane today.

Instead of the 3Vs (which change over time) we will use four more resilient criteria introduced by Viktor Mayer-Schonberger and Kenneth Cukier in their book Big Data to explain what makes Big Data really Big Data. We think their taxonomy provides a better framework for understanding Big Data because they describe general principles of Big Data that clearly distinguish it from earlier data science and these principles won't fundamentally change with time. Their criteria are:

  1. "n=all": Instead of sampling data, analyze the entire population in order to avoid missing outliers that can be very important but don't occur often. For instance, the vast majority of credit card transactions are legitimate. Any sample is likely to miss the fingerprints of a prior fraud that you might see repeated in the next credit card transaction you analyze.
  2. "Messiness": A willingness to trade measurement error for data volume. In other words, sometimes (perhaps often) more but less accurate measurements trump fewer measurements made with more precise instruments. This is fundamental to crowd sourcing. If you ask a thousand drivers on the I-5 freeway to estimate their speed, you are more likely to get an accurate prediction than if you take a couple of measurements with a well-calibrated radar gun. 
  3. "Correlation over Cause": A willingness to ignore cause/effect relationships and accept correlation as good enough. Traditional data science has cautioned against a reliance on correlation alone. But if hemlines really do correlate with stock prices, maybe that's all you need to make successful investments. Of course, favoring correlation doesn't arise from reckless abandon, but instead from deep within the roots of Big Data. Steve Jurvetson, a Partner, at DFJ Ventures gave an insightful explanation at a 2015 session of the MIT Bay Area Enterprise Forum (a nexus of venture capital and technology). Jurvetson's session, entitled "Deep Learning: Intelligence from Big Data", focused on Deep Networks or Deep Learning Networks, which are complex multi-layer learning machines that are common in Big Data. For instance, Google uses Deep Networks to identify subject similarities between YouTube videos. Jurvetson looked at what goes on between the layers of Deep Networks and concluded that while these networks can learn very complex patterns, it is impossible to reverse engineer the pattern; you can't take a network that has learned to recognize pictures of cats and then "run the program backwards" to give a prose explanation of what a cat picture looks like. All of that information is captured in weight coefficients on the network that have been learned through an iterative process. But then Jervetson noted that other ways of solving similarly difficult problems--genetic programming, cellular automata, etc.-- are plagued by this same issue. An antenna designed by genetic programming, he notes, may do a very good job of capturing a signal (better than any human designed antenna) but engineers can't explain why, because the output of the genetic programming process can't be reasoned about. And Jurvetson speculates that this quality may be inherent in any technology capable of human-like creative process.
  4. "Re-purposing": The application of existing data to answer questions remote from those that the data was originally collected to resolve. This is often a factor in the "messiness" of Big Data. Geo-location data provides a host of examples. For instance, all manner of boats and ships use a system called Automatic Identification System (AIS) to broadcast their location and read the location of other vessels. The purpose of AIS is to prevent collisions and provide emergency response. But historical AIS data can be used to identify seasonal changes in trade winds and fish migrations. The data is messy, it tracks where the fishing boats are, not necessarily where the fish are, but there is lots of data, and if the fishing boats tend to be where the fish are, reasonably good migratory tracks can be obtained.

One salient difference between the Mayer-Schonberger criteria and the 3 Vs is that the 3 Vs are about what we can't do with current technology; when we have to employ special methods. The Mayer-Schonberger criteria, by contrast, are about what we can do with current technology that we couldn't do a few years ago. The Mayer-Schonberger criteria are transforming data science. But they also come with great responsibilities and a need to interpret results and make decisions with care and with expertise. The Mayer-Schonberger criteria are also not universal. As an example, "n=all" doesn't apply to most Big Data sensor problems because the population n is infinite. The sensors inherently sample the environment. In the case of Gorgon Stare, the vehicles can be observed at any point in time. Gorgon Stare simply chooses to sample the scene at 70 ms intervals. We don't get all of the observations, we only get the ones Gorgon Stare chooses to sample. But what we do with Gorgon Stare data will tend to fit Big Data now and for the foreseeable future: it will be messy, with tracks starting and stopping because of the data collection and not because a vehicle suddenly disappeared; the vehicle tracks will arise from correlation. If a vehicle disappears at point A, we will look for it at point B because that fits a pattern seen in millions or billions of other tracks and not because of a deliberate model. And the data produced by Gorgon Stare will no doubt be re-purposed to understanding traffic patterns and positioning gas stations and other uses its designers weren't concerned with.

Risks of the Big Data Approach
As mentioned above, the Big Data approach is somewhat orthogonal to traditional data science and has a number of pitfalls. Here are some risks with the Big Data approach that practitioners and stakeholders need to be aware of:

  1. "n=all": Processing everything raises risks of failing to distinguish true outliers from errors. The reason data scientists traditionally removed outliers was that they were more likely to be errors, whether someone typed a number in wrong, or read a meter wrong, or there was a data transmission error. With enough data, outliers will exist and so we don't want to dismiss them. But at the same time, we don't want simple errors to dominate decision making. As an example within this class of problem, Harvard Professors Carmen Reinhart and Ken Rogoff published what came to be accepted as "scientific truth" on the relationship between government debt and economic growth. Their results indicated that countries experienced sharply slower growth once their debt-to-GDP ratio exceeded 90%. Governments made policy decisions based on that finding. But in 2013, a graduate student at the University of Massachusetts discovered that Reinhart's and Rogoff's findings resulted from a spreadsheet error. Much of the recent economic misery in Western Europe can be attributed to government policies that were strongly influenced by that spreadsheet error. This does not mean that the Big Data approach is wrong. It just means that you need to know what you are doing and how to interpret results. 
  2. "Messiness": Measurement bias cannot be overcome by volume, and its effects can be more dangerous with more data. There is a difference between measurement error and bias. Importantly, large volumes of data can overcome measurement error. The variance shrinks with the inverse of the square root of the number of measurements. So four measurements made with cheap thermometers can give a more reliable average than one measurement with an expensive thermometer. That's really useful to know if the expensive thermometer costs ten times as much as the cheap one. But volume does not overcome bias. In fact, the variance shrinks by the same formula, it's just that the average is wrong. So a large number of biased measurements gives an estimator with very high certainty, but is wrong! This is a case when more data can prove worse. If we have one measure of temperature, and it's biased by 2 degrees, the actual temperature would still be within the error limits given by the resulting estimator. But if we had a million temperature measurements each biased by 2 degrees, the variance would collapse to almost zero, meaning that the estimator would say that, with high certainty, the actual temperature was not possibly true. Applying this to the crowd sourcing problem, if people who are stuck in traffic tend to think that they are going much slower than they actually are, then a crowd-sourced estimator of traffic speed would be worse than useless without correcting for that bias, and with twice as many measurements would be even more useless. This exact problem arises in practice. Sinan Aral, an Associate Professor of Information Technology and Marketing at MIT’s Sloan School of Management, found that crowd sourced reviews suffered from significant upward bias making crowd sourced restaurant reviews unreliable. Professor Aral's findings were published in the journal Science. 
  3. "Correlation over Cause": A willingness to ignore cause can open results up to "overfitting". Big Data environments tend to make measurements of many variables available. And with enough variables, you can fit just about any historical data. Many models give a near-perfect correlation to historical stock market data but fail to predict future stock prices any better than spinning a wheel, because they lack a causal connection to stock market performance. Traditionally, data scientists have tried to build models with just a few variables to avoid this overfitting, and they have tried to choose variables where a causal connection could be explained. The Big Data approach doesn't mean that we can dismiss causality. It means that correlation can indicate causal relationships that we don't understand or are opaque to us, but we still need to interpret the results with care. There are, right now, programs at DARPA to learn how to bring causality back into the Big Data world view. Basically, to use Big Data to discover and demonstrate cause. 
  4. "Re-purposing": Using data for a purpose other than those it was collected for may induce unwanted bias. For instance, data on service satisfaction taken from callers who dialed a Customer Service Hotline, and re-purposed to identify potential future purchases may not reflect the preferences of the buying public at large. Further, the inappropriate re-purposing of data may lead to an inappropriate intrusion into people's privacy. For instance, given the amount of data available on IMDB, the movie database, it is possible to uniquely identify a supposedly anonymous reviewer from just six movie reviews. This has a number of chilling effects: one is from data scientists who can profit from decoding people's online identities and one from the sources of the data themselves, who are more apt to lie, and use "e-chaff" and other deceptions to disguise their identities and in the process add a lot of noise and bias to the data. 
To reiterate, this does not mean that the Big Data approach is wrong. But it does mean that applying it blindly, or interpreting results without skill can be very dangerous.

Big Data and the DIKW Pyramid
Most common Big Data examples are data fusion problems, where the objective is to take stuff that can be measured (data), and turn it into models of things that are useful elements of business process (information). This isn't always the case, but it usually is. So the term "Big Data Analytics" as it is commonly used is really a misnomer. It should be "Big Data Fusion", and what's under the hood of most Big Data algorithms is much more closely related to data fusion. A consequence of this is that the results of this Big Data processing do not really answer enterprise business questions. They still require analytics to answer business questions, and control to usefully produce value for the enterprise. Analytics that interpret the models that result from Big Data Fusion need to account for the differences between the traditional data science approach and the Big Data approach. They need to account for the possibilities of overfitting and bias, etc. as described above, and when "n=all" is in force, they need to regard the statistics as population statistics rather than sample statistics.

S3 Data Science
S3 Data Science, copyright 2015.