S3 Home‎ > ‎Our Work‎ > ‎Data Science‎ > ‎

Data Science Explained

Minimalist Description: Data science is about refining data to help make wise decisions. Data science and simulation science are like the right hand and the left hand of decision support. In fact, they both have a Control component because at the highest level of refinement, data science and simulation science use the same tools. But data science is data-driven. Instead of using notional representations of the real world to make decisions, data science supports decisions through direct input from the real world. There are three different kinds of refinement methods in data science, each of which transforms things at one level in the knowledge hierarchy into things a level higher up:
  • Data Fusion—Makes data relevant by transforming it into information. The things we can measure and the things we want to know are usually different. Data fusion turns measurements into models of useful variables.
  • Analytics—Transform information into knowledge that answers questions crucial to the enterprise. Analytics interpret the information and make sense of it.
  • Control—Defines decisions and courses of action. Control tells us what to do to be effective.
Data fusion, analytics, and control are closely related, but in most businesses, they have different stakeholders. The executives sit at the top of the pyramid, and so they are more interested in control. The information workers are toward the bottom of the pyramid, and so they are more interested in data fusion. Without the taxonomy we've shown here, the enterprise looks too complicated, and whole stakeholder groups are likely to be overlooked. If you work with S3, many of our questions will be about what kind of data science problems you want to solve and which levels in the knowledge hierarchy are involved in each problem. Answers to those questions help us identify the stakeholders. Once we know who they are, we have a variety of proven tools capture their individual needs. Then, and only then, does it make sense to do the heavy math.

Details:
Data is the life blood of most organizations. If you are reading this, chances are that you already knew that. Maybe your data is deliberate, like readings from electricity usage meters. Maybe it's inadvertent, like the time between clicks on a web site, or the difference in average decision time between customers who accept an offer and those who decline it. Regardless, the processes and tools that made your data possible are probably expensive.

 

The payoff for collecting and managing that data comes from transforming it into actionable business intelligence that improves decision making and enterprise effectiveness, and that payoff can be huge. This isn't a new idea. Alvin Toffler, author of Future Shock, and The Third Wave, has been describing this for decades. 


"We are interrelating data in more ways, giving them context, and thus forming them into information; and we are assembling chunks of information into larger and larger models and architectures of knowledge. ...knowledge has gone from being an adjunct of money power and muscle power, to being their very essence. It is, in fact, the ultimate amplifier. This is the key to the powershift that lies ahead, and it explains why the battle for control of knowledge and the means of communication is heating up all over the world." Alvin Toffler, PowerShift, 1991.


This "powershift" is the subject of data science, and the subtlety of Toffler's message is that it's not the data that serves as the amplifier of power, but the knowledge that's generated downstream of the data. Data science is the fulcrum that delivers this power.



The Data/Information/Knowledge/Wisdom (DIKW) Pyramid


The Knowledge Hierarchy or Data / Information / Knowledge / Wisdom Pyramid that Alvin Toffler alluded to in PowerShift has long been used by management scientists to describe the transformation of enterprise data into effective enterprise actions. The DIKW pyramid is also extremely useful in understanding what data science is, and how it works. Just as raw materials get assembled into components in an industrial factory, and those components eventually get assembled into finished products that are sold, the DIKW pyramid describes how raw data gets transformed and assembled into decisions and actions. 


  • Data: The raw materials of data science. Data usually doesn't directly model the things we want. Instead, it's the stuff we can directly observe. But those observations can be combined to model the things we want.
  • Information: Models of the variables of interest to the enterprise. It doesn't answer business questions but it gives us "situational awareness" of the enterprise. It tells us the state of things.
  • Knowledge: Answers questions of business interest, particularly, how the situation would change if we made some decision; knowledge is the power to project the impact of business choices.
  • Wisdom: The ability to influence the situation in the most positive and effective way.


The main job of any enterprise—whether they mow lawns, teach kids, trade stocks, or defend the nation—is to elevate data through the pyramid and realize the payoff. The salient differences between any two enterprises are the data they start with, the goals they want to achieve, and the methods or resources they use to stretch their data into knowledge and wisdom.


The Pyramid gives us a direction to go, but not a way to get there. Between each pair of layers in the Pyramid is a tool that lets us refine what we have into something richer and more suited to supporting decisions. Data Science is the toolbox that carries those tools. 



The Data Science Toolbox


Data science is the process of understanding your data and maximizing the value you get from it. S3 does data science, and we take a comparatively broad view of data science, one covering the entire DIKW Pyramid. There are three boundaries between the layers in the pyramid, each of which confronts a different kind of problem and a different kind of complexity, and there are, roughly speaking, three different kinds of data science to deal with those problems, each of which has its own unique tools and methods. 

  • Data Fusion—Raw data measures the stuff that can be observed in the world. But typically, we can't directly see or measure the business variables we are interested in. A temperature reading doesn't model the weather. But “fusing” many temperature readings from different places and times gives a weather model that predicts temperatures over a region over a course of days. Data fusion usually results in a model or a set of models that describe the business variables.
  • Analytics—While the weather map provides information about prevailing winds, it doesn't interpret that information and make sense of it for us. Is there a storm off the coast? How fast is it moving? Is it big enough to threaten coastal communities? These kinds of questions are the business of analytics.
  • Control—Knowing that a tropical storm currently threatens coastal residents doesn't tell us how to respond. Do we need to evacuate? If so, when? What are the best exit routes? What is the plan? Control is about turning enterprise knowledge into decisions and actions.

Often, data science gets confused with analytics. Analytics are an important part of data science, but they aren't the whole story. And analytics are in the middle transition of the DIKW pyramid. A lot of processing has to occur before analytics become useful, and a lot of processing has to occur afterward in order to convert the analytics into a course of action. 



Data Science and the Enterprise


Most enterprises:

  1. Already have data, in some cases they don't even know it.
  2. They pay a lot for it, because it is the consequence of all of their business process. And,
  3. They don't get nearly the payoff from their data that they should.

Our job is to provide the services, tools and methods to help you move data through your unique DIKW pyramid and in that process, realize the data's latent value. Sometimes that's a Big Data problem, but let's not get carried away by Big Data just yet. S3 does Big Data, and we have a statement on Big Data elsewhere on this web site. But most businesses have more "Little Data" problems than Big Data problems. One quick way to distinguish the two is to ask whether your problem is about using the data for the reason you collected it. If you are, it's almost certainly a Little Data problem, as Big Data is mostly about re-purposing existing data. The more important question is "what can you do with the data available to you?" We will work with you on that question. We will do as much as we can to use the data you have; we will recommend data that you could or should collect, and we will re-purpose data, where useful, to maximize the contribution of data science to enterprise value.


These are not easy questions to answer well, and data science is not a magic bullet. There are important new tools and understanding available that weren't around even two years ago, but we will probably have to work hard together to get the value you want. Understanding your DIKW pyramid will help us work together. It gives all of us a framework for understanding each other and for understanding what we have (e.g. data, information, or knowledge) and it motivates the questions we will ask. If we know where you are in the DIKW pyramid and where you want to get to, that helps immensely in making the data science work for you. If you have data and want information, we will work with you to define the business variables you want to model and identify how your data can be fused to estimate those variables. If you have information and want knowledge, we will work with you to capture the business questions you are trying to answer and the analytics to answer them. When we know these things, it defines the types of problems we will be solving together, and the types of tools that are appropriate to solve those problems.



Communication and Systems Engineering Tools


We think it is important to understand data science in terms of all three different of its components (data fusion, analytics, and control), the relationships between those components, and the different kinds of problems each component corresponds to. Some of the biggest mistakes and misunderstandings occur when the data science gets mismatched to the problem. For instance when data gets mistaken for information. The DIKW pyramid is a simple but powerful tool to organize that thinking and to keep the stakeholder and the data scientist on the same page. But it isn't our only tool. Our work in systems engineering has given us experience with a number of tools that we use regularly to communicate with customers and stakeholders about the problem and the solution. These tools include:


  • Heilmeier Questions: On the Research Portfolio page, we talk more about Heilmeier questions. The Heilmeier questions were developed so that technical peers could talk to each other about problems and solutions in a very efficient way. The answers to a set of Heilmeier questions is called a "Heilmeier Catechism". It's a very useful tool for describing the deep nature of a problem, why it is relevant, and how a solution will work. Heilmeier Catechism is a very good discriminator. If we can't answer these questions, we probably can't succeed, and if we can answer Heilmeier questions we probably can succeed.
  • Quad/Pentacharts: Quad charts and pentacharts boil most of the stuff of a Heilmeier Catechism down into a single chart. They are formal charts. We create them by filling in a template, but it can be a very hard job because you need to really understand the problem and solution to get the chart right. The "aha moment" on a project often comes when building a pentachart because it forces you to answer hard questions. Once you have these charts, they help the team maintain focus and they can be briefed to other stakeholders so that they have the big picture.
  • Unified Modeling Language: The Unified Modeling Language (UML) is a set of formally defined diagrams that describe different aspects of a system. They can describe data flows and synchronization, decision points, components and relationships (a superset of entity/relationship diagrams), use cases, swim lanes, etc. UML diagrams are easy to do wrong, but hard to do right. If they are done right, even a casual reader can understand the system they describe. This means there is less confusion between the stakeholders, the data scientists, and the system. We have been using UML for 15 years and we think were pretty good at it.
  • Zackman Framework: The Zachman framework is really just a table. It captures answers to the questions of why, how, what, who, and when, and over the different levels of the system, from an initial statement of scope to the functioning system under maintenance. It identifies what needs to be done and who is responsible. Zackman is a powerful management tool and checklist for technical work.
  • Etc: There are many other systems engineering tools we draw on. In addition, when it's required, we work with group facilitators like Big Picture Solutions who have their own powerful tools and techniques for achieving clarity and consensus.


The thing all of these systems engineering tools have in common is that they are about communication; they are about finding the essence of complex problems; they are about asking questions and providing a channel to listen. They give you the chance to ask questions of us, and us the chance to ask questions of you. And at the beginning, you might not know the answers. You might never have thought about these questions. In our experience, there is always a huge amount of implicit knowledge in the way people do their jobs. They don't think about it because it has become routine: assumptions about the way data is recorded, time limits on data, and so on. If these things don't get exposed, the resulting data science can be worthless.


We make a solemn promise not to produce artifacts that we don't think are worthwhile or add substantial clarity. We have these tools at the ready, and we use them in the spirit of Einstein's advice that "Everything should be made as simple as possible, but not simpler". 




S3 Data Science
S3 Data Science, copyright 2015.