S3 Home‎ > ‎Our Work‎ > ‎Areas of Expertise‎ > ‎

Data Fusion

Minimalist Description

Data fusion combines data about things into models of those things. Regression is probably the simplest example. We have a set of independent measurements of a thing, each potentially having some error, and we want to give our best estimate of the value of the thing. But fusion isn't always that easy. Let's consider some issues:

  • Independence: Sometimes we have measurements but they are not independent (or we don't know that they are independent). For instance, two different news agencies each report that your competitor's new product is 10% better than their old product. Did each news agency do their own test, or was one simply re-reporting the other? If it's the latter case, then the second agency offers no new information. So how do we combine measurements if we don't know their independence? How do we avoid the "rumor propagation" problem that results in over confidence because we've heard the same story from multiple sources? 
  • Association: Do we know which measurements to include? Sensors typically don't know what objects a measurement is associated with. They just receive a signal. Figuring out which measurements to combine is often a much harder problem than actually combining them.
  • Error and Bias: Say we have a child with a suspected fever. So we take regular temperature measurements to monitor the fever. There is error in our measurements, but the more measurements we make, the more confident we are of our child's temperature. Except if there is bias in the measurements. More measurements doesn't help with bias. It only gives us more confidence in the wrong thing. What if the thermometer tends to measure a half degree low? That may give us false confidence that our child's fever has broken. Now let's say that our Aunt Vera is staying with us, and you measure the child's temperature roughly hourly, and Aunt Vera measures the child's temperature in between your measurements, but Aunt Vera uses the thermometer she brought with her and it tends to measure a half degree high. Now, unless we correct for the biases, the child's temperature looks like it's going up and down every half hour.

Even regression can be nuanced. Typically, when we solve a regression problem we assume a model. Take the most trivial case, where we want to know the weight of a sack of apples and we do so by measuring the weight on four different scales and then taking the mean. The model is that the weight is a single number. If we are trying to find the relationship between the weight of cows and their food intake, the regression model might be a linear one. But there are some regression problems where we don't know what the model is. Scientific discovery is full of examples. Johannes Kepler had lots of data on the planets. Somehow from that he figured out that their orbits were elliptical. That's a case of what we call "gray matter fusion". But could a computer, given that data, figure out that the orbits were elliptical? It turns out that it can. This problem is known as "symbolic regression", and is usually solved by a technique known as "Genetic Programming", which uses Darwinian evolution to find the best model and then fit that model to the data.

Multi-Target Tracking (MTT) is another classic data fusion problem that illustrates many important characteristics of data fusion problems. Consider the case where there are some number of (moving) objects that you would like to track, and you have several sensors providing data about those objects. This leads to two problems:
  1. Association: Which sensor measurements are associated with which objects?
  2. Estimation: Where do we expect the object to be at a given time? 
It turns out that association and estimation are intimately connected. If you associate a different set of sensor measurements with an object, you get a different estimator. And whatever estimators you have affect how you associate the next measurements. Standard techniques for solving the MTT problem used a hard separation between association and estimation, treating the solution as a sort of assembly line where measurements were separated into different baskets and then an estimator was produced for each basket. State-of-the-art techniques accept the fact that association and estimation are tightly bound, and allow a measurement to be divided amongst many baskets, essentially saying that there are a number of objects that might potentially be responsible for that measurement. 

Sometimes the objects want to be tracked, like in a busy harbor where no vessel wants to collide with another vessel. Even the smallest boats will use radar reflectors to make sure that they are seen, and bigger boats will broadcast their own measurements that effectively say "I'm here. I'm here." But there are situations where the objects don't want to be tracked and they may use stealth to keep from being observed, or decoys to create false measurements. But these complications don't relieve us from the task of solving the problem. We still need to estimate the locations of the targets, even in the face of deception, and there are some very advanced methods that allow us to make good estimates about an object even in the face of deception.

Bayes nets or Probabilistic Graphical Models (PGMs) are another important class of data fusion method. In a PGM, a graph captures the relationships between statistically dependent objects. Associated with each object is a probabilistic model. As we get more information from the world (i.e. observations) about an object, messages are sent to the graph nodes representing the dependent objects. These objects update their probability model through Bayes Rule and send the update to the objects that depend on them. The graph hence captures everything that is known about the world of those objects. The combination of applying Bayes Rule and sending update messages is known as the Belief Propagation algorithm. Belief propagation is well defined where there are no loops in the graphical model. But some of the most interesting problems have loops. For instance, in the MTT problem, we've said that association and estimation are coupled. An update to a target's estimated position may change the way that observations (including past observations) are associated with it. The associations will update the positions which update the associations which update the positions, and so on. This is called "Loopy Belief Propagation". It is not particularly well defined, and there are some problems for which it doesn't work, or at least doesn't work very well. But for other problems, Loopy Belief Propagation works well. But it is a bit ad hoc. We need rules like when to stop sending messages. This is an area of open research, and a very interesting one. Some of these problems are addressed by tools like Figaro, the probabilistic programming language, which can be thought of as a general tool for data fusion.

S3 Data Science, copyright 2015