Simulation Science Explained

Simulation science is pretty close to what people have traditionally called modeling and simulation, and descriptive models are the basic building blocks of simulation science. A descriptive model is one that makes predictions with abstractions instead of with data. Those abstractions might be physical laws, or business rules, or communications protocols, or rules of engagement in military campaigns. Unlike data models, descriptive models attempt to prescribe cause and effect.

The difference between descriptive models and data science models is important. Data science gives us a "most likely" view; the picture of the world as it actually seems to be trending. Among other things, descriptive models can give us a "worst case" view; what might happen if we aren't careful. When we make decisions, we usually want to know both! As we get closer and closer to the events we are trying to predict, the two models converge, or at least they usually should, because there are fewer and fewer bad things that might happen.

Simulation science has its own DIKW pyramid, and it looks similar to the DIKW pyramid for data science, except that it is about abstracting the world rather than measuring it. In data science, we use fusion to go from the rabble of raw observations to information. In simulation science, we use abstraction to go from the rabble of facts about the world to information about the domain. 

Illustrative example
Let's say we have a pendulum, and we want to predict the swing angle of the pendulum bob at some future point in time. Our data science approach would take in a set of observed swing angles at particular times, and output a model, ϴ(t), by doing some kind of curve fit. Instead of observing the pendulum itself, our simulation science approach observes the forces on it, as described by Newton's Second Law of Motion. We would abstract away insignificant forces like spurious magnetic attraction leaving gravity as the only significant force being applied to the pendulum, resulting in a model of the pendulum as the initial value problem:
This constitutes a descriptive model of the problem at the "Information" level in the the DIKW pyramid. But it isn't yet knowledge; it doesn't answer the question of where the pendulum bob will be. In order to get there, we need to solve the initial value problem. Typically we would use a numerical model (an approximation). Tools like Mathematica and Matlab  make that relatively straight forward. and that knowledge model could be used to make the same predictions as our data science model. 

But the descriptive model is, well, more descriptive. For instance, it tells us that changing the weight on the pendulum bob won't change its behavior because it depends only on the initial position and angular velocity, and a gravitational force. It also tells us that if we moved the pendulum to the Moon, which produces a different gravitational force, the pendulum would behave differently. With our data science approach, we would have to send the pendulum to the Moon, start it moving, and collect observations in order to answer that question. In this case, the additional richness provided by the causal connections in the descriptive model makes the knowledge available at far lower cost. 

This example isn't totally outrageous. We've seen problems where actual data collection required a rocket launch or a military exercise. In those cases you only get one chance and you want to use a descriptive model to thoroughly explore possibilities and eliminate as many failure paths as you can before taking the big plunge. Furthermore, in business situations, particularly at the top of the DIKW pyramid where we are trying to answer questions about resource control, we are less interested in how the world is and more concerned with how we can make it better. Descriptive models give us insight into what might result from a change in the world. We can often create a data science model to describe the current situation, and a descriptive model to describe the goal (called the "objective function") and bring the two models together for an optimal outcome. 

Often, a descriptive model comes before a data science model. If it costs a great deal of time or money or risk to collect data, then you want to explore a descriptive model first. You will want to learn as much about the process as you can with a descriptive model before spending the money, so you anticipate the range of things that might happen once data collection begins. We encountered an archetypal example of this a number of years ago when ballistic missiles were a big concern. There's an old and mostly true story about when the Ballistic Missile Early Warning System first went into service, it triggered an alert on a moonrise (October 5, 1960) and we nearly started World War III over it. So it is crucially important to be able to distinguish missiles from space rocks. We were tasked to identify spectroscopic differences between asteroids and missiles. So we developed a descriptive model of the "plumes" of re-entry vehicles based on the unique chemistry of the thin upper atmosphere, and made predictions about some unique plume signatures that only objects launched from Earth and re-entering the atmosphere would have. We confirmed the descriptive model predictions by monitoring an actual re-entry vehicle (one that we controlled), but that required a rocket launch which took years to plan and cost millions of dollars. The descriptive model allowed us to extract as much value as possible from that launch, which produced enough data to support data science experiments for several years thereafter. The Plume program required a considerable amount of software development, but the software costs were dwarfed by the cost of the rocket launch, and so the software more than paid for itself.

Sometimes, a descriptive model comes after a data science model. For instance, if the data indicate a particular relationship but you need to know the cause. You can build a descriptive model--a domain model that encompasses cause and effect--and test the descriptive model with simulated data to see if it produces the relationship observed in the data science model. If not, then you have likely missed something in the domain model.

Sometimes a data science model and a descriptive model are developed at the same time, for instance if the data science needs to be verified before it can be fielded. We experienced just such a situation in a program known as Future Combat Systems working with Boeing and the U.S. Army. We were working on a problem known as Distributed Fusion Management, which involved doing data fusion with many powerful sensors in a highly constrained network. For that particular domain, "data fusion" is synonymous with tracking vehicle targets (a track is a model of a vehicle's position and velocity). The sensors and the vehicles they were to be mounted on were in development at the same time, as was the communication technology intended to supply the links between them. And so it was impossible to verify the distributed data fusion in the intended environment. So we developed a descriptive model that simulated the sensors, the vehicle dynamics, and their communications, and embedded the actual distributed data fusion software into the simulation. The simulation included enough visualization so that we could view the changing responsibilities of the different vehicle nodes as they moved--and as the targets moved--through a simulated landscape. The image below is one frame from a long simulation. The vehicle represented by the pink diamond is being tracked by the aerial vehicles represented by the yellow diamonds, which are flying past the target. We found it easier to see the vehicles with the ground details blurred. The distributed fusion is performed by a "process group", which is an autonomous group of network nodes that have connectivity and satisfy certain resource constraints. There are several roles in this particular process group. There is a group manager, who assigns the other roles. There are several fusion nodes that supply sensor data. These have to, as a group, provide sufficient target measurements to get a good fix on the target. In the figure, these nodes are connected to the manager by solid lines. We say "as a group" because if all of the sensor nodes are in the same line, they won't give a good target location. They must have good combined geometry with the target. There are several nodes that serve as a "board of directors". These are connected to the manager via dotted lines. These nodes watch over the manager, and if the manager loses communication for any reason, the board of directors dismisses the manager and elects a new group manager. Distributed Fusion Management needed to continue doing its data science function regardless of node failures anywhere in the system. So this fault tolerance was a major feature. The actual mechanics of all of this was not as complex as a complete explanation. In good systems, a few simple rules actually result in complex adaptive behavior that is well-suited to the intended function.

This combination of a data science model and a descriptive model allowed Distributed Fusion Management to move from the government's Technology Readiness Level 1 (back of the envelope) to Technology Readiness Level 5 ("breadboard validation") in less than three years, which was nearly unprecedented at the time. 
Future Combat Systems aerial target tracking (2006)

If descriptive models are the building blocks of simulation science, simulation is the universal wrench that brings those building blocks together. Those simulations might involve numerical solutions to systems of differential equations. For instance, we were once tasked to identify why a satellite was moving off track after about a day. We developed a descriptive model using a fairly primitive (fourth-order Runge-Kutta) numerical model of the system of differential equations. Our simple model matched the actual satellite behavior quite well, and we were able to use it as a tool to debug the predictor the government was using, and were fairly quickly able to identify the problem as a mistake in equating orbital energy with semi-major axis in an orbit that was not quite elliptical due to the bulges in the gravitational field of the Earth.

Moreover, it is becoming clear that there is a need for both data science models and descriptive models of the things we are interested in. A new discipline called "probabilistic programming" is emerging that combines simulation science with data science. It is not yet a mature field, but it is the subject of DARPA's Probabilistic Programming for Advancing Machine Learning (PPAML) program, and it has great potential to unify simulation science and data science. Probabilistic programs allow you to define a descriptive stochastic model of your system, and then apply evidence, in the form of observations of the world (data) to the model. This is very useful. We are not a part of that DARPA program, but we do use Figaro, one of the probabilistic programming tools being developed in that program.

S3 Data Science
S3 Data Science, copyright 2015.