So much data, so little time

Print Friendly, PDF & Email

Don’t be fooled by powerful computers and fancy algorithms. You still must know what matters and how to measure it correctly.
by Daniel E. Stimpson, Ph.D.

Have you ever wondered why Nigerian scammers tell people they’re from Nigeria? Did you know this is simply their solution to a common big data problem also faced by many institutions, including the U.S. Army? But I’m getting ahead of myself.

These days we’re hearing so much about big data, machine learning and artificial intelligence that it’s becoming an article of faith that these hold the key to unlock every door. We’re told that, with enough computing power, we can overcome virtually every obstacle in our path. But what has really changed about the fundamental enterprise of institutional learning and innovation? The fact is, not as much as you might think.

Data is just what we call the digits and symbols that represent information. It’s the understanding of the underlying information that matters. Yet today, more than ever, data masquerading as useful information can flood decision-makers. If it is not skillfully filtered and processed, voluminous data can give the impression of meaning, while much of the most relevant information is misplaced and obscured. In fact, there is nothing to be gained by an information deluge. Professor Alan Washburn of the Naval Postgraduate School said it this way: “Information is only useful to a decision process if a decision-maker has the power to use it to make smarter decisions.” This remains true no matter how big the data gets and how flashy software becomes.

We must not confuse a computer’s ability to win a game of chess or drive a car in formalized settings with human cognition. Gaming and driving are impressive programming accomplishments, but they don’t require any true intelligence, understanding or thinking from the computer. The computer is only processing 1s and 0s. That’s it. It has no “understanding” of anything it is doing, why it matters, or what makes one set of 1s and 0s more important than another. The computer simply processes the digits it is given, exactly how it is instructed to. It never wonders why, gets bored, or has a sudden insight. The program designer is doing all the thinking, making all the value judgements, and deciding if any of the 1s and 0s being produced have meaning or worth. There’s nothing our modern computer science mystics have done to change this.

The challenge of quickly amassing useful information is not new. Long before the present big data mania, 19th-century theorist Carl von Clausewitz, in his 1873 book “On War,” described a pervasive characteristic of war as a “fog” of uncertainty in which a military force must operate. However, it’s not just the operational military that lacks useful information when it is needed and must take action under uncertainty. This condition characterizes every aggressive, forward-thinking organization engaged in ambitious undertakings.

During a lecture at the University of Virginia in the 1960s, Nobel Prize-winning economist Ronald Coase said, “If you torture the data long enough, nature will always confess.” This remains a central concern of the big data revolution. Employing powerful computers to churn through mountains of data does not guarantee increased insight. The basic rules of research still apply. We must begin with a question and be clear about what we are trying to accomplish. Otherwise, we can become lost in the data mountains, following the computer on a digital path to nowhere, in a high-tech version of the blind leading the blind.

“Garbage in, garbage out” (GIGO) expresses the fundamental principle that computer algorithms can only produce results as good as the data that feeds them. The simple fact is that no algorithm can create quality information from garbage data. And increasing the quantity of such data doesn’t improve matters.

The sheer volume of data that computers can process today makes the GIGO problem increasingly acute. In 1979, David Leinweber of the RAND Corp. prepared a note for the U.S. Department of Energy that illustrates this principle clearly and succinctly. His hand-drawn chart, published before the advent of computer graphics, demonstrates the inescapable tradeoff between increasing model complexity (called model specificity (S)) and measurement error (M) in mathematical models. Leinweber’s chart shows how increasing model complexity can increase a model’s explanatory power by reducing unexplained variation or mathematical error (eS). But Leinweber’s illustration also shows that this comes at a price. It turns out that, the more calculations we do on imprecise measurements, the more the measurement error (eM) is compounded. This is shown by the eM line rising from left to right.

Leinweber’s model-error illustration

Leinweber’s model-error illustration shows that the minimum total model error results from the trade-off between decreasing model error (eS) from increasing model specificity (eT) and measurement error from increased computation on imperfect data (eM). (SourceSOURCE: “Models, Complexity, and Error. A RAND Note prepared for the U.S. Department of Energy,” by David Leinweber, 1979).


It is also important to understand that, even if we can attain perfect data, model error can increase with model size and complexity as the result of a wide range of distorting influences. For example, even with today’s high-powered computers, many important problems remain too large to ever solve or at least take too long to solve in the available time, so solutions can only be approximated by a sequence of mathematical shortcuts. This difficulty becomes greatly exaggerated when complex mathematics are performed on sparse or inaccurate data.

Taken together, we see the inherent tradeoff between more and less data and greater and lesser specification in calculations. Just as Goldilocks found, there is a place where models and data usage are not too hot and not too cold. Importantly, observe that the best model is somewhere well short of maximizing either the data usage or the model complexity.

This demonstrates Leinweber’s contention that the only valid reason to add more data and complexity to a mathematical model is to increase the accuracy of its result. In practice, getting this right requires domain knowledge and mathematical skill, not just the latest software package. And getting the model right matters greatly. As Leinweber wrote, “Important policy decisions should not be based on noise.” Depending on the data in question, there may be so much noise that reliable inferences are impossible.

In his book “Antifragile,” scholar and statistician Nassim Nicholas Taleb more recently said it this way: “As we acquire more data, we have the ability to find many, many more statistically significant correlations. Most of these correlations are spurious and deceive us when we’re trying to understand a situation. Falsity grows exponentially the more data we collect. The haystack gets bigger, but the needle we are looking for is still buried deep inside.”

According to John P.A. Ioannidis, professor of medicine and health research at Stanford University, in his paper “Why Most Published Research Findings are False,” this concern is not just theoretical: “There is increasing concern that in modern [medical] research, false findings may be the majority or even the vast majority of published research claims.” Further, according to a 2016 survey by the premier science journal, Nature, 52 percent of researchers believe there is a “significant crisis” because the majority of published findings in many research fields cannot be duplicated. Only 3 percent stated there was no crisis at all. So before you seek the help of supercomputers and modern analytics, pay close attention to the quality of your information and the complexity of your approach.

Computer science has now entered the Zettabyte Era. A zettabyte is a measure of digital information equaling 1021 (or 1,000 billion billion) bytes. According to Cisco Systems, global data volume exceeded one zettabyte in 2012 and internet traffic exceeded one zettabyte in 2016.

For a sense of how large this is, it has been estimated that printing one zettabyte in book form would require paper amounting to three times the trees on the Earth today. By 2020, the world data quantity is expected to be over 40 zettabytes. This is a truly staggering number. According to the National Oceanographic and Atmospheric Administration, 4.5 zettabytes is about equal to the number of ounces of water in all the world’s oceans.

Still, reliable information is the lifeblood of any process of understanding. In fact, in our information age, high-quality data should be thought of as a strategic asset and a force multiplier. But, as the late David A. Schum wrote in “The Evidential Foundations of Probabilistic Reasoning,” our current methods for gathering, storing, retrieving and transmitting information far exceed in number and effectiveness our methods for putting it to use and drawing conclusions. And modern machine learning, in many cases, can make this problem worse by finding unimportant correlations that can distract from the real issue being addressed.

Carpenters teach an important lesson about the importance of having good information before taking action: “Measure twice; cut once.” The same is true of any data collection effort. But good data is often much harder to obtain than we might expect. Frequently, the necessary data is never recorded when it could have been, or it’s recorded for a different purpose or without the care and precision necessary for the current problem. While there are myriad potential obstacles to attaining reliable information, here are a few of the most common missteps:

  • Taking inexact measurements.
  • Using improper and inconsistent collection procedures.
  • Inaccurate data recording and retrieval.
  • Measuring the wrong things.
  • Poor data management, access and security.
  • Information hiding (dishonesty and fraud).

Good data collection requires planning, dedicated effort and long-term care to avoid all these sources of error. With limited resources, this is a management effort that requires setting clear priorities and leadership because we can’t collect quality data on everything. Rather, we need to carefully focus on what we really need to know. Best-selling novelist W. Bruce Cameron wrote, “Not everything that can be counted counts. Not everything that counts can be counted.” This reminds us to focus on meaningful, accurate measurement of our objectives, not measurement for measurement’s sake.

Big Data - Analytics - Solution

Flashy software and more data may not be what is needed to improve decision support. (Image by ArtHead/Getty Images)


The importance of sober thought about the effort required to satisfy data requirements is a major theme in Thomas Sowell’s landmark book, “Knowledge and Decisions.” Sowell, an economist and social theorist, points out that, most of the time, we grossly underestimate the cost of the information collection required to make informed, top-down decisions in complex environments. Consequently, the extent to which the processes we design require detailed information is an important concern that deserves significant resources and effort upfront rather than being assumed away, leading to cost overruns or disappointing results later. Unfortunately this occurs all too often.

A great practical example of the principle of proper focus comes from Cormac Herley. He asked, “Why do Nigerian scammers say they are from Nigeria?” His counterintuitive insight is that criminals have a big data problem just like the rest of us. For them, finding gullible victims is a “needles in a haystack” problem. Why? Because the number of people receptive to their scam is small. Like the rest of us, crooks have limited time and energy, and they need to quickly filter out the vast number of people who are unlikely to give up their money to focus on those who most likely will. Otherwise, they will spend too much time on the hard targets and never get to the soft ones. By making themselves very obvious, they filter out all but the easiest victims, i.e., those who don’t question why someone from Nigeria needs their help. This enables them to concentrate their limited time and energy on the most trusting, highest payoff population.

Just like Nigerian scammers, those facing a large, complex problem can’t afford to focus their limited time and resources on noise or fruitless pursuits. They must learn how to carefully discriminate and sift through the mountains of potential data to find the information that matters most. So, before you hire that whiz kid with the machine-learning algorithms, get your information collection process straight.

Occam’s razor is a philosophical principle from the 14th century that is just as true today as ever. In short, it states that when there are two or more explanations with equal explanatory power, the simpler one is preferred. Alternately, it can be expressed this way: The more assumptions an explanation requires, the more likely it is to be false. It’s really just a sophisticated version of the popular idiom “keep it simple, stupid,” or KISS.

This is not to say everything is simple. Rather, just as it was seven centuries ago, the right amount of simplification is critical to our ability to construct accurate models of reality and solve meaningful problems. Again, this is shown by Leinweber’s eT line in Figure 1.

There are many examples of simple models outperforming complex ones in this digital age. Take two provided by Nobel laureate Daniel Kahneman, who points out that predicting marriage stability does not require complicated measures of people’s psychology, finances, religion or myriad other considerations. Rather, a simple formula actually can work much better.

It turns out that if we simply sum the frequency of lovemaking and subtract the frequency of quarrels between a husband and wife, we have an excellent predictor of the long-term prospects of their relationship. If this number is positive, they are in good shape, while a negative number spells trouble. Kahneman also offers the example of a model for predicting the value of highly collectible, expensive Bordeaux wines. Here, a very simple model with just three variables (summer temperature, previous winter rainfall, and rainfall during harvest) predicts a wine’s value with 90 percent accuracy across a horizon of multiple decades.

Don’t be mistaken; simple models that work are not generally simple to develop. They require thorough understanding of the often complex phenomena being represented. In other words, someone has to do the hard work of figuring out what matters most among everything under consideration. Then, they have to figure out how to measure correctly. Until these occur, no model, whether simple or complex, is likely to help. Here, every leader should take note. In most cases, being unable to assemble a straightforward model of your problem is a strong indicator that you don’t fully understand what exactly you are trying to solve.

Remember, too much information is as bad as too little. Big data analytics can open the aperture so we see more than ever before. They can challenge our paradigms and reveal things previously hidden from us. But this depends on the accuracy and precision of the information we feed our algorithms. If done well, combining great computer power with vast data provides great opportunities. If done poorly, it can lead to enormous confusion and spectacular mistakes. So, when the data miners come knocking, remember you need to already have intimate understanding of the problem you are trying to solve and you must have already recorded reliable information. Only then should you release them to begin work. Also remember that the powers of technology are not magical solutions to solve every ill. They are just one of many tools available to address complex problems. So stay humble, stay in charge, and don’t be easily dazzled.

DANIEL E. STIMPSON, Ph.D. is an operations research systems analyst in the U.S. Army Director of Acquisition Career Management (DACM) Office and an associate professor of operations research at George Mason University. He holds a master’s degree and a Ph.D. in operations research from the Naval Postgraduate School and George Mason University, respectively. Before joining the Army DACM office, he retired from the Marine Corps after 24 years of enlisted and officer service. He has also been an operations research systems analyst with the Center for Naval Analyses, George Mason University research faculty, the Joint Improvised Explosive Device Defeat Organization, and Headquarters Marine Corps.

This article is published in the Summer 2019 issue of Army AL&T magazine.

Subscribe to Army AL&T News – the premier online news source for the Army Acquisition Workforce.