#### Minimalist Summary

Automated calculation has been a major subject of human endeavor since the dawn of written history. This is literally true. We have

another page on this web site that discusses the history of devices for automatic calculation--the "hardware"--and that history began with the cuneiform tablet in ancient Sumeria. In the archeological ruins of the Tigris and Euphrates we find clay cuneiform tablets, the earliest form of writing, capturing multiplication tables and aids in predicting the positions of planets and stars right beside *The Epic of Gilgamesh* and contracts to buy beer. So automated calculation has been with us for a while. This essay addresses the "software" component: the algorithms of calculation, and why calculation with floating-point numbers is just a different kind of animal than computation with integers.

The short answer to the question of why calculation with floating-point numbers is so hard is that floating-point "numbers"--the computer's representation of real numbers--are not really numbers at all, but intervals on the real number line, and those intervals are of varying length, so operations on these intervals may have unexpected results that algorithms and the software that governs them need to guard for. Two algorithms that would describe mathematically identical things if floating-point numbers behaved like real numbers may produce radically different results when real numbers are represented by floating-point numbers (of course, we will show some examples). Often, the guards that protect algorithms from the effects of floating-point representation lead to unfortunate trade offs that limit other desirable qualities like speed and parallelizability, and the ability to reason about the algorithm, either by humans or machines.

Floating-point calculations are important because our lives, quite literally, depend on them. They are the "man behind the curtain" in the guidance systems for commercial aircraft and the control systems for the brakes in your car. Floating-point calculations are important because they drive the economy. The value of stocks and other securities on trading markets are determined by these operations. Floating-point calculations are important because it's what computers spend a great deal of their time doing. Supercomputers like America's Titan and China's Sunway TaihuLight machines are built not to sort lists or run web sites, but to do floating-point calculations. The power of supercomputers is even measured in FLOPS (FLoating-Point Operations Per Second). Sunway TaihuLight performs an amazing hundred million billion FLoating-Point Operations Per Second, and it might take that computer hours at a hundred million billion FLoating-Point Operations Per Second to, say, simulate the formation of a star from the gases thrown off by a supernova. So indeed, floating-point operations let us see the universe in higher-and-higher fidelity. They allow us to find natural resources with less risk. They let us take humans out of harm's way. Floating-point operations allow us to predict the weather and design effective drugs and understand nuclear ignitions.

But floating-point calculations are also fraught with problems. So much so that it's a wonder they work at all. We will discuss issues like floating-point representation, and work through a couple of useful examples, and relate these to the problems of interval calculus and roundoff error. We will introduce "machine epsilon", which describes the best case accuracy we can expect from calculations. We will explain what it means for a problem to be well-conditioned or ill-conditioned. We will informally define stability and provide some formal examples of stability "regions" when we work through an Initial Value Problem whose dynamics are defined by a stiff differential equation. We will discuss conditioning more completely in the context of stiff differential equations, and relate the condition number to the stability of various solution methods. We will describe the differences between explicit and implicit methods for Initial Value Problems conceptually, in terms of stability, and in terms of evolving the problem state from initial value to final value, particularly in the case of a stiff system. Along the way, we make friends with linear systems, and with rule rewrite systems that alleviate some of the issues floating-point calculations raise.

There's a classic textbook from the 1960s by Gene Isaacson and Herb Keller called *Analysis of Numerical Methods*. That's pretty deep in the primordial history of automated calculation. The Isaacson and Keller book has long since outlived its usefulness as a classroom text, but still finds its way onto the bookshelves of every puckish computational scientist (it's certainly on our bookshelf) because of a clever trick that happens even before the book gets started. When you take the first letter of every sentence in the Preface and string them together, it spells out: "Down with computers and their lackeys". We show the first page, below. Click on the picture for an expanded view. If you're not convinced, seek it out in a good technical library. Or it's in reprint in paperback on Dover. That's nowhere near as fun as an original hardbound edition, but will still be a good conversation piece that sends your friends scurrying to find similarly dark hidden messages in the fortexts of their favorite tomes.

Regardless, if Gene Isaacson and Herb Keller--two people who knew as much about automated calculation as anyone in the world--found the subject so troublesome that they decreed: "Down with computers!", we think the people who rely on automated calculation for their health, for their livelihood, for their security, should take the subject very seriously and make sure that a good job is done of it.

This essay begins with a gory explanation of why Keller and Isaacson held such a negative view of computer-based calculation, and why nothing has really changed. And like Keller and Isaacson, we also talk about things that can be done to deal with these issues. In a couple of places we give some rationale as to why, if you have some calculation problems that need solving, we think *S3 Data Science* is a right-good choice to solve them.

**No Calculation without Representation**

When we talk about "calculation", we mean computation with floating-point numbers. It's often called numerical analysis but we call it automated calculation to emphasize that our goal is not just to analyze the pathology of some algorithms but to actually make them do useful calculations. The field of computer science deals mostly with so called "discrete" problems. That is, problems whose inputs and outputs are (or could be) described as integers: what sequence should this collection of shop tasks be performed in? What position would the element "Foo" occupy if the list was sorted? What street intersections should I place my cops on to have a view of all streets in the precinct with the minimum number of beat cops? Those problems deal with integers and so they are discrete. Floating-point numbers on a computer are very different from integers. They are also different from pencil-and-paper real numbers, and that's what got Herb Keller's knickers in a twist.

Computer science has a companion field called computational science that deals with these problems of automated calculation. Thirty years ago, when university computer science departments were typically tucked into the basements of the math department, people could often be seen wandering to-and-fro between computer science and computational science, usually with their heads leaning up to the sky and bumping into walls, but nonetheless crossing over between the discrete and the floating-point. That's not so true today. Now, when computer science is off on its own in most universities, very few people make the trek from discrete computation to floating-point computation, or vice versa. Their heads are still cast to the sky and they are still bouncing off walls, but they are not throwing caution to the wind to experience the joys of floating-point from a discrete past or discrete from a floating-point past. We have good relationships with universities and even spend a bit of time in the classroom to keep sharp. We see very few students taking courses in both *Combinatorial Algorithms *and* Numerical Algorithms*. This is a shame because the two subjects complement each other so nicely. At *S3 Data Science*, we feel lucky to have rich skills in both areas, and we think we can put those skills together to help you not only solve calculation problems, but also to understand them better.

Given a fixed number of on/off bits, a computer can only represent a finite quantity of things. Let's assume for purposes of this essay that we are working with 32-bit floating-point numbers, or "single-precision". We usually use double-precision, 64-bit floating-point numbers for real work, but for current purposes, the magnitudes of the numbers and figures in our descriptions will be much easier to understand if we stick to 32-bits.

With 32 bits we can only represent 2^{32}, or about 4 billion different "numbers". The maximum value of a 32-bit floating-point number (in IEEE representation) is about 3.402823×10^{38}. That's billions of billions of billions. And there are about as many numbers between 0 and 1 as there are between 1 and 3.402823×10^{38}, and as many negative numbers as positive ones, so something has to give somewhere, and it's the business of the floating-point representation to make that compromise in a way that preserves the integrity of calculations as much as possible.

The IEEE single-precision floating-point numbers are represented in mantissa/exponent form with a sign bit, an 8-bit exponent, and a 23-bit mantissa or "fraction" because it is treated as a fractional value:

The "value" of the entire floating-point "number" is computed as

In the example in the figure above (taken from Wikipedia), the exponent (in green), interpreted as a simple binary integer, takes on the value 124. So the value of the exponent part of the overall number is 2^{124-127}=2^{-3}. The value of the mantissa, as a simple binary integer, is 2^{-2}, so the "fractional value" according to the formula above is 1+^{1}/_{4}=^{5}/_{4}, and the sign is -1^{0}, or 1, giving a final "value" of ^{5}/_{4}(2^{-3})=^{5}/_{32}=0.15625. Let's call that number *q*. Similarly, our largest number *f*_{max} is (1 − 2^{−24})×2^{128}≈3.4×10^{38}, and our smallest positive number* f*_{min} is 2^{-126}≈1.17549×10^{-38}.

But there is another number worth looking at. Consider the next number larger than *q*. Call it *q*^{+}. That number can be found by putting a 1 in the 0-bit of the mantissa. So *q*^{+} is 0.15625+2^{-23}×2^{-3}, or 0.15625+2^{-26}. We cannot distinguish any number between *q* and *q*^{+}. Those numbers exist on the real number line and in pencil-and-paper arithmetic, but we cannot distinguish them in the 32-bit floating-point system. So the set of bits that we've called a value is not really a value. That set of bits is a representation of the half-closed interval [*q*,*q*^{+}) of length 2^{-26}, the numbers indistinguishable from *q*, and we label that interval 0.15625. But it is only through laziness that we think of those bits as being an actual real number. Let's summarize:

*Floating-point "numbers" are not numbers at all, but instead intervals: *The key to understanding floating-point numbers on a computer is that each value doesn't represent a specific number in the way we usually think of numbers, but instead represents an interval on the real number line. And so all of those floating point operations like addition and multiplication are operations on intervals, not on actual numbers.

So let's look at an example of why these intervals create such a problem. Let's say that we have a bunch of intervals of width 2, labeled by the leftmost value of the interval, as in the "stair step" figure below.

Now let's choose an actual real number in the first interval and multiply it by an actual real number in the third interval. The answer must be another interval because 32-bit floating-point intervals are the coin of this realm. They are the only tools we have to give answers. But which interval do we choose, remembering that we cannot distinguish between the different actual numbers represented by the interval? The blue numbers in the figure above represent one possibility, and they result in a product whose value falls in the first interval, labeled "0.0". The red numbers represent another possibility. But the result of multiplying the red numbers would select the sixth interval, labeled "10.0". Other choices would select the second, third, fourth or fifth intervals. This is unsettling. But it gets worse.

**These intervals are of different sizes: **This is clear from the way the representative "labels" are computed. For instance, when we computed *q*^{+} in the example above, we turned the lowest order bit of the mantissa on to find the next "number" larger than *q*, and then that got adjusted by 2^{-3}, because that's what came out of the exponent part of the calculation. The exponent field determined the width of the interval associated with *q*. If there had been different bits in the exponent, the interval would have been of different width. As a consequence, the smallest intervals are about 1.2x10^{-38} wide whereas the largest intervals are on order of 10^{31} wide. That's a range of difference of about 10^{69} between the widest and narrowest intervals.

To put it into perspective, if the narrowest interval is the width of a fly's eyelash, then the widest interval is vastly wider than the entire universe. And now we want to make calculations that take a number from somewhere in the "fly's eyelash" interval and a number from somewhere in the "bigger than the universe" interval, and perform a meaningful operation on them. That's not going to work out so well. It's probably good that the "fly's eyelash" is the interval associated with 0 because, as the stair step example shows, it's not good to have numbers of significant magnitude that are indistinguishable from 0. To be sure, when we move to 64-bit floating point numbers, we have 4 billion times as many intervals, but the same problems occur, and the difference in the ranges is even more vast. What we can guarantee is that (except for over/underflow) if we get a result from some addition or multiplication, say *a**x**b=c*, then there some actual real number in the interval of *a* and some other actual real number in the interval of* b* that, when multiplied together to get exactly *c*. We don't know what the real numbers *a* and *b* are, but they exist. That's still not very satisfying when the intervals vary from a fly's eyelash to the size of the universe.

Let's say we multiply a 32-bit floating point number close to 1 by a number in the largest interval. The result can be off by the width of that interval, or by a factor of about 10^{31}, because all numbers in that interval are represented by the same floating-point number on the computer. All of this leads to some counter-intuitive issues like:

**Floating-point addition and multiplication are not commutative on a computer.****Floating-point addition and multiplication are not associative on a computer either!**

Let's look at these problems for a moment. For review, floating-point addition and multiplication are both commutative and associative with pencil and paper. The assertion that they are *not* commutative or associative on a computer should be disturbing. But this is easy to illustrate with an overflow example, which we show with this notional program.

a = smallest_positive_floating_point_number

b = largest_positive_floating_point_number

c = a * a * b * b

So let's assume that operations are performed left-to-right, which is the normal convention. When* axa* is calculated, we get an underflow error. Now let's scramble the order of the operations a bit to get another version of *c*.

c = a * b * a * b

Here, *axb* is about 1, and that times *a* is *a*, and that times *b* is about 1. So we get an underflow the first way, and 1 the second way. That's one way to show that floating-point multiplication on a computer is not commutative. We can show how associativity fails in almost the same way. Associativity is the property that we can parenthesize an expression any way we want and get the same answer. If we parenthesize the first expression in the natural way that most programming languages work, it looks like this:

c = (((a * a) * b) * b)

As before, that gives us an underflow error when we multiply *axa*. Now let's try a different parenthesization:

c = ((a * (a * b)) * b)

It's almost the same except we calculate *axb* first and that avoids the overflow: *axb* is about 1, and *ax*1 is *a*, which multiplied by *b* is about 1. So floating-point multiplication on a computer isn't even associative. We could construct similar examples to show that floating-point addition has the same issues. We've chosen the boundary case of over/underflow because it is an easy illustration, but we could have chosen lots of other cases without overflow and demonstrated much the same thing.

This has a somewhat unexpected implication in parallel computing. Most parallelization is based on associativity. Basically, if some long chain of operations is associative, you can chunk it up any way you want and pass the pieces on to different processors for computation, then assemble the results. This assumption underlies Map-Reduce. Associativity is the fundamental requirement of a "monad", which is functional programming's most powerful tool for parallelizing algorithms (Map-Reduce is a form of a monad). Monads allow the programmer to fully define a computation without defining the order of the steps. That's a mind bending concept to some programmers. But defining a computation is a very different thing from defining the order of operations. Mathematicians do this all the time. Let's give an example:

This function completely expresses a computation. The result* j(n)* is completely describable from the expression. But the expression doesn't specify that we compute the sum from lowest to highest, or that we do the computation in one loop as opposed to cutting it into two pieces, passing (1+...+*n/*2) off to one processor to compute and ([(*n*/2)+1]+...+*n*) off to another processor to compute, and then assembling the result, which would be faster if we had two cores to operate with. We can do all of that because *j(n)* operates on integers rather than floating-point numbers. Integer addition is associative on a computer, so the runtime system can carry out either of those implementations without quarrel from us because both of those implementations are consistent with the computation as expressed. The programmer packages up a big computation (without specifying order of operations) and lets the system figure out how to order the operations. That's pretty much what monads do. But monads assume that the operators are associative, and we've just shown that even simple floating-point multiplication isn't really associative when done on a computer. Unfortunately, floating-point addition and multiplication are *almost* commutative and *almost* associative, and if we aren't careful we can be lulled into forgetting the *almost* part. That's why Herb Keller and Gene Isaacson cried out "Down with computers"!

A consequence of all of this is that some numerical processes--some calculations--are unstable when you express them in a "monady" way. Gram-Schmidt orthogonalization is an excellent example. Orthogonalization is important in lots of applications, like image processing, solving systems of equations, etc. There is a naive way to implement Gram-Schmidt, but it turns out that that is not a very stable algorithm. It can be easily expressed as a monad, making it highly parallelizable, but it's unstable. That is to say that the errors in the floating-point calculations are not well-bounded by the magnitudes of the numbers they work on. If the condition number of the matrix being orthogonalized is high, then the resulting errors in floating-point arithmetic can overwhelm the result.

There is a stable version of Gram-Schmidt that fixes this stability problem by a small change in the order of operations. The result is remarkably stable, but cannot be parallelized by expressing it as a monad. Remember that a monad expresses the computation but not the order of operations. Parallelizations of Gram-Schmidt rely on a very careful specification of the order of operations. As a result, it is only a little bit parallelizable. This has deep repercussions as our processors aren't getting much faster (we are nearing the limits of how small we can make things and hence how much time it takes for an electron to move across that distance, and hence how fast the processor can be). Processors are getting more parallel, but not faster. Independence of operations is a requirement for parallelism. If order of operations is important to correctness, more processor cores doesn't help. The good news is that associativity is only one way to get independence of operations. When we give up associativity, we may still be able to find some other place where the order of instructions doesn't affect the result. The bad news is that those areas tend to be much smaller and harder to exploit, and some otherwise really nice calculation algorithms become slowpokes in a multicore world.

So how bad is all of this in real life? The numbers we used in the prior examples were really large or really small. What happens if the numbers are reasonable, every day magnitude in size? Consider this simple function which should evaluate to exactly 0 for all values of *k* in a world without floating-point error. Basically, the program subtracts a calculation of *k*/10 from the actual value of *k*/10:

Here is a straightforward program in the Julia language to compute *f(k)*.

We chose the Julia language because it's optimized for calculation, so there should be no surprises (note that Julia defaults to 64-bit floating-point numbers so we've suffixed the numbers with "f0" to put Julia in 32-bit floating-point mode). When we run this program on values of k ranging from 1 to 2^{32}, we do get a surprise, shown in the chart below. Namely, *f(k)*≠*0* for *k*≥8!

As* k* climbs to 32000, the residual approaches 1. And 32000 is not a big number either, at least not for a computer. As* k* gets into the billions, the residual approaches *k*/10, which seems to be a limit. This suggests that the sum in the function is approaching a constant value as *k* grows. The picture will look the same for 64-bit floating-point numbers (just remove the "f0" in the code) albeit with a smaller slope. And you will get the same results if you write the code in C++ or Java or Fortran. This isn't really a problem of programming language. These issues are inherent in the IEEE floating-point number standard, and that's the best floating-point number system we have. Other ways of doing floating-point calculation are worse.

So here we are. With an incredibly simple calculation and incredibly small numbers, the results are already worthless. How do we explain such catastrophic failure to our parents and children? Or equally important, why should we believe that *any* floating-point calculation done on a computer works? If you were an astronaut boarding a spacecraft on a $1 trillion mission to Mars, how would you feel if you were strapped into the cockpit, the hatch locked from the outside and the thrusters rumbling, and you look at your launch checklist clipboard and saw the figure above? In fact, the figure above would be a pretty useful thing to have on the launch checklist, with the warning: "check and recheck your numerics, because computers don't always calculate the way you think they should!"

As a shameless plug for our own mother ship, *S3 Data Science*, if you were strapped into that rocketship, would you feel more comfortable if *S3* developed the code for the launch protocols, with our understanding of these issues and their solutions and decades of work with these kinds of calculations and their pathology, or would you rather that the code for your launch protocols be developed by that hacker kid down the street who didn't need to learn any of this stuff in college 'cause he just loves programming computers and even built the web site for his Nintendo club? We think we are the better choice for that job, but we need to give you a bit more reason to share our confidence. That's the subject of the rest of this essay.

Before we get going here, let's make a couple of observations about the experiments we did with the *k*/10 problem. We said that the problem, which led to an ever expanding residual error, was rooted in the floating-point system and not in our choice of programming language. But a human could analyze the Julia function for* f(k) *and realize that it should always, mathematically, return 0. The calculation it defines is one of many ways to say "zero" mathematically. If we did that analysis and cleaned up the code, the problem would go away. These kinds of issues are prevalent enough that there are programming approaches, sometimes called "rule rewrite systems" that try to do the analysis for you. *Mathematica* is our favorite tool for this kind of work. When we write the same function in *Mathematica*, it tries to simplify the expression internally without any help from us. This is another way in which the programmer defines the computation but not the order of instructions.

As you can see in the sample above, *Mathematica* analyzed the description of *f(k)*, recognized that the sum was just a way to calculate *k*/10, then simplified the subtraction and concluded that *f(k)* is always exactly 0.0 no matter what value of k is input (note Out[116]). One cannot always work in a rule rewrite system. Sometimes, for instance, we need to build our code into a piece of hardware. And rule rewrite systems sometimes can't find simplifications that help. But these systems can act as important guard rails that prevent a lot of numerical accidents. We use *Mathematica* for a lot of prototyping work because it reduces errors. The reason this worked for this problem was that we gave *Mathematica* an expression that it could reason about (*Sum[1.0/10.0, {i,1,k}*). To reiterate an earlier point, where possible, we want to specify the computation without specifying the steps (but we need to do so in a way that cannot be misinterpreted). The "Sum" expression in *Mathematica* does just that. But if we had have specified the steps of the operation in the form of a loop, as in the figure below, *Mathematica* is just as bad as Julia or C or Java at error accumulation, because the explicit ordering of steps leaves nothing to simplify. This is one reason we prefer to program at the highest level of abstraction that is practical. It leads to programs that are easier to reason about by both humans and computers.

Another observation about the *k*/10 problem is that the residual errors of* f(k)* are accumulated. It's death by a billion cuts. For *k*<8, *f(k)* is exactly 0, just like it would be if there were no floating-point errors. It's only when we go through the loop over and over, accumulating the approximation of *k*/10 that we also accumulate error. So how do we go from a residual of exactly 0 to not exactly 0 when *k*=8? There are a couple of ways of looking at this. There are two causes, and both of these are illustrated in the stair step figure above. The first kind of problem arises when we have a "label" on the interval that isn't the actual value of the number we are interested in. Like if the real number is 1.4 in the stair step figure, it is in the interval labeled "0.0". In the stair step figure, 0.0 and 1.4 are in some sense equal because they map to the same interval, but this can cause confusion. If we add 1.4 to some number *x*, the result is still *x*. because 0.0 represents that interval. So the second issue arises when the result of some operation gets mapped to the "wrong" interval, and by "wrong" we mean that the floating-point operation disagrees with pencil-and-paper arithmetic with real numbers. In the case of the* k*/10 problem, the label "0.0" is actually the the value of the real number we are interested in. So our error accumulates when the result of the floating-point operations in *f(k)* fall outside the interval labeled "0.0".

This raises a potent question. When are two numbers distinct? There is an important property of a floating-point system called "machine epsilon", or ε, which captures this for the specific real number 1. So ε is the smallest number such that (1+ε)≠1. That's something we can compute (a very good approximation of) by writing a little program that does a fixed-point iteration, each time cutting ε in half. We will illustrate this, initially at least, with a piece of code that finds ε recursively. Recursion is neat because it allows you to define something as a mathematical assertion rather than as a process as we would do with a loop. Mathematical assertions are easier to reason about philosophically and mathematically. In other words it's easier to prove the mathematical assertions correct than it is to prove a process correct. Here is that function, again in the Julia language:

**function** machine_epsilon_tail_recursive()

**if** 1.0f0 + eps/2.0f0 == 1.0f0

This function uses a special kind of recursion called a "tail recursion" with an "accumulator" to solve for ε. This is a common pattern in functional programming. This is a tail recursion because the last thing the inner-function (the accumulator) does is to call itself. When a function is tail recursive, the compiler or runtime system can detect that and rewritten as a loop. So the programmer can write the function recursively, and the machine can unroll the recursion and execute the function as a loop instead of a recursion. Loops do not, in and of themselves, expand the stack, so the loop is more efficient than recursion for the machine, and the recursion is more efficient than the loop for the programmer. We get the best of both worlds. This is a lot like what *Mathematica* did when it saw the "Sum" function. It's another example of how the programmer can express a computation without specifying the order of the instructions, leaving that task to the machine. In fact, optimizing compilers do this all the time in little ways. We just don't think about it.

However, in the case of Julia, tail calls are not optimized, or at least not always optimized. They may be optimized in the future, and there are discussions about this on their Wiki, but they do not appear to be optimized at present, except when the LLVM which Julia relies on does its own tail call optimization. This is a good stuff know about. In the Scala and Scheme languages, tail calls are optimized. In Java, and C, they are not. In Julia and Clojure they are in a gray area. So in the case of the Julia language, it is probably better to compute ε with a loop as follows, though both functions return exactly the same result:

**function** machine_epsilon_loop()

**while** 1.0f0 + eps / 2.0f0 != 1.0f0

If you don't know how your floating-point system behaves, this function can be used to get an approximate measure of ε. This is especially useful in, say, C language programs where you may not know whether the compiler is using 32-bit or 64-bit or even 128-bit floating-point representation. The C specification says only that floats must be at least single precision (32-bit) and doubles must be at least double precision; they may be longer. Here, we have, again, forced Julia into single precision mode and both functions return

**ε=1.1920929x10**^{-7}

If we know more about our floating-point system, we can figure out ε analytically. In the case of our 32-bit representation, we can set the exponent part to 0 (i.e. 2^{0}=1) and then switch only the lowest-order mantissa bit on, similar to what we did for *q*^{+}. This gives the smallest floating-point "number" distinctly larger than 1, which is precisely what we mean by 1+ε. That gives ε as 2^{-23}, or ε=1.1920929x10^{-7}, which agrees precisely with what our program gave us (and also with the IEEE standard).

So basically, what all of this means is that a single single-precision floating-point multiplication is good to 6+ decimal places. This is more accurate than almost any sensor. But all of this is subject to the constraint that roundoff errors are not allowed to accumulate. This 6+ decimal places is as good as we can do (without going to double precision) and the hard work is in developing methods that do not allow error to accumulate. Once again, this can be counter-intuitive.

Say, for instance, that we are simulating a nonlinear system whose dynamics are known. One obvious way to get more accuracy is to make the timestep smaller. And if, as you make the timestep smaller and smaller the system seems to be converging to a particular state, it is tempting to say that that's the right state; that's the state that the system would converge to with infinitely small timesteps and no floating-point error. But smaller timesteps means more of them, and that means more opportunities for error to accumulate; more opportunities for death by a billion cuts. Sometimes smaller timesteps can cause you to converge to the *wrong* answer. In the *k*/10 example above, *k* can be construed as the number of timesteps in some system. As the chart shows, the error in the final system state increases with timesteps, because of error accumulation. It seems to be converging to the conclusion that the summation part of the function vanishes as a fraction of *k*/10. But that isn't true. We know from inspection that the two parts of the function are mathematically equivalent. The key is to find better methods, not smaller timesteps.

**I just checked in to see what condition my condition was in**

Let's pose a problem; a fairly typical problem in fact. We have two sensors, call them Sensor #1 and Sensor #2, and we want to use them to find out where a target (called Target) is located. Our sensors are "angle only" sensors, meaning that they can tell what direction something is in, but they can't tell how far it is. And like most sensors, there is some error involved. The sensors are good to a few milliradians but they can't give the precise angle. So, where would we want to put the sensors to get the best target viewing conditions?

We would probably want the sensors to view the target at right angles, as in the figure above, to get the best viewing conditions, because that would give us an area of uncertainty for the combined estimate --the gray area in the figure above--where the longest line drawn through the area of uncertainty was as short as possible. That distance is essentially the length of the longer of the red and green lines, each of which measures a dimension of the area of uncertainty. Since the sensors are viewing at right angles to one another, the red and green lines are of about the same length. One thing to observe here is that the corners of the area of uncertainty for the combined estimate can be found by solving a linear system. So there is a matrix down there somewhere. We've drawn a picture instead of showing a matrix, but there is a matrix hiding in the shadows.

Note that this figure employs a data fusion method called "covariance intersection", and that deserves a warning. Covariance intersection is not a particularly good way to combine the measurements, but it is really good for explaining some principles. In essence, it takes the intersection of the areas of uncertainty for each of the sensors and projects that as the area of uncertainty for the combined estimate.

But what if we can't view the target from ideal conditions? what if, for instance, the target is a satellite in a distant orbit and our viewing angles are limited by where we can put the sensors on the Earth, as in the figure below?

Here the green line is substantially longer than the red line. That indicates that these viewing conditions are not as good as before. The ratio of the longest-to-shortest dimensions of the uncertainty area is called the "condition number" of the problem. If the condition number is small, we call the problem "well-conditioned". If the condition number is large, we call the problem "ill-conditioned". What constitutes a large condition number depends on lots of things. And there's usually little we can do about it. If we are trying to find the position of a distant star rather than a satellite, the area of uncertainty is going to be skinnier still, and the condition number is going to be much higher, and the only way to reduce the condition number would be to get the sensors further apart. We might be able to put the sensors on orbiting satellites if we had enough money, but if the star is billions of times farther from Earth than Earth's diameter, even with satellites it's an ill-conditioned problem. The condition number is a part of the problem, not typically part of its solution. It characterizes the numerical difficulty of the problem.

Remember, we said there was a matrix lurking in the shadows, so let's bring it out and shine some light on it. Pretty much any problem has a matrix lurking, and for similar reasons, pretty much any problem has a condition number. If we know the values of that matrix, call it *A*, then the condition number of the problem, usually denoted *κ(A),* is the ratio of the largest singular value of *A* to the smallest singular value of *A*, or *s*_{max}(A)/s_{min}(A). The example problem has two dimensions. Most real world problems have more dimensions and so they will have more singular values. Singular value decomposition is too much to bite off here, but suffice it to say:

- The singular values will be non-negative real numbers, and further, that
- There are ways of computing the singular values if we had to, but
- Typically, we don't compute a full singular value decomposition to find the condition number because that's slow. Instead we compute good approximations of the condition number using something called Arnoldi iteration which avoids the complexity of a full singular value decomposition.
- The exact value of the condition number is seldom very useful, so the approximation is good enough to know the problem's difficulty, and
- If any of the singular values is 0, then
*A* is not just ill-conditioned, it is singular and there is no solution to linear systems involving it. That would be like placing Sensor #1 and Sensor #2 in exactly the same place and pointing them in the same direction.

Finally, and probably most importantly,** the condition number tells us how much precision we lose when solving the associated linear system**. We have said, above, that when working with single-precision floating point arithmetic, we can expect 6+ digits of accuracy in individual operations. When solving a linear system (which is one of the big hammers in automated calculation) we may lose as many as *log*_{10}(*κ(A)) d*igits of precision. So if the condition number is 1 million, which it might be if we were using terrestrial sensors to locate a star, our 6+ digits of single-precision floating-point accuracy is down to 0 digits after solving the linear system. Like the condition number itself, that's not something we have much control over. It's baked in. This is one reason we usually use double precision floating-point. But it's important to keep in mind that even if your floating-point machine provides 16 digits of precision, if your sensors only give you 3, once you solve an ill-conditioned linear system you may not have much precision left. We certainly don't want to nest linear systems solutions inside one another. This is yet another reason people pay lots and lots of money for sensors that give them just one more order of magnitude of precision. Once again, these are all reasons why Isaacson and Keller cried "Down with computers!"

Certainly, if someone has collected data with 3 digits of accuracy, formed it into a single-precision linear system with a condition number of 100; solved that system and reported 7 significant digits in their results, you might want to ask a few probing questions.

We haven't said what to do with the condition number yet. A partial answer is that condition helps us choose the right methods, and we will demonstrate that in the next section. If a system is very well conditioned, just about any correct method will work well. Some of those may be highly parallelized, or just fast in general. Some may be easier to implement than others. These are all features that we'd like to take advantage of. If we know that a system is poorly conditioned, our choices are fewer and we may have to sacrifice some of those other features.

**Finally a really stiff problem**

It's time for an example. There are lots of candidates. We could choose a problem from optimization, or linear algebra, or signal processing. Of course, each of these areas overlaps. The problem we chose is to simulate a system described by a simple differential equation. Under the hood the calculations to simulate such things involve linear systems and optimization so the whole toolbox will be employed.

We understand most of the world through differential equations. They describe, amongst other things, the way systems evolve through time; their dynamics. That makes them important in simulations. Differential equations are one of the main reasons that we do calculations, and they usually involve a whole lot of calculations so we want them to be automated calculations.

These problems often present themselves as "initial value problems", or IVPs. In an IVP, we know the initial state of the system, and we know its dynamics. Then we project the system through time, from its initial state, via those dynamics. Unfortunately, in all but the most trivial cases, no analytic solution is available for the differential equation, and those dynamics need to be applied by numerical means. That is, by a principled approximation. Typically, these simulations evolve in small timesteps (denoted here as *Δt*). We see where the system is and where it is headed, and we apply that heading for a small amount of time and do it again. There are lots of ways to push a system through its dynamics, and they all have different properties. Choosing the right method is critical. And then given the right method, choosing the right timestep is critical. These problems have a condition number, and there is a matrix lurking. The matrix is the Jacobian, and the condition number depends on the singular values of that matrix. Each method has a stability region which defines the timestep sizes that can be used without exponential error growth for the canonical problem

y' = λy

Here λ can be interpreted as an eigenvalue (i.e. of the Jacobian). Eigenvalues are closely related to singular values. Basically, the singular values are the magnitudes of the eigenvalues. So if we know the (canonical) stability region for a problem, we can divide through by the magnitude of the largest eigenvalue of the Jacobian for our unique problem to get the timesteps that will lead to stable behavior. Here is the stability region for the Forward Euler method, one of the simplest methods of evolving systems of ordinary differential equations, plotted in the complex plane.

In most cases, we are only concerned with the intersection between the stability region and the negative part of the real axis. We focus on the negative part because for negative λ values and positive Δt values, Δtλ (the horizontal axis) is negative. That puts us to the left of the origin in the figure above. Positive λ values correspond to systems that violate the principle of conservation of energy, so they are less interesting. So in the case of Forward Euler, the interval (-2,0), and the stable timesteps satisfy the inequality

Δt < -^{2}/_{λ}

Sometimes, system dynamics operate over both short and long timescales. Take an aircraft wing as an example. The dominant force may be the force of the wind over the wing, and that force may vary only slowly. But the wing may pick up a vibration from the fuselage. That vibration may oscillate very fast by comparison to the wind force. Such a system is called a "stiff" system, and is characterized by having a Jacobian with small magnitude eigenvalues (i.e. associated with the wind forces) and large magnitude eigenvalues (i.e. associated with the vibration). If this stiffness leads to a largest magnitude eigenvalue λ_{max}=1000000, which is quite reasonable, then the safe timestep is a maximum of 0.000002. Such a small timestep makes the system very difficult to simulate. Possible workarounds include:

*Ignoring the vibration because it is small. *But even though the wind force dominates our wing, ignoring the vibration will likely drive the simulated system to a completely different state from its actual state because the effects of the vibration accumulate. *Accepting a very tiny timestep and living with it. *That may cause the system to run very slowly, and worse. it may cause errors to accumulate because so many more calculations need to be performed, and as we've shown above, more calculations means more opportunity for roundoff to accumulate; more opportunities for death by a billion cuts.**Finding a better method, one with a stability region we can work with.** This is the right answer. It may not be the easiest answer, but it's the right one.

There are two main classes of methods to approach IVPs:

*Explicit Methods: *These methods use past calculations and the system dynamics to push forward to the next step. The Forward Euler method is an explicit method.*Implicit Methods: *These methods use past calculations and dependencies on future calculations, and the system dynamics to push forward to the next step.

This description above of the implicit methods leaves something to be desired. So we should elaborate. Consider the current state of a simulated system* y*_{t}, and that we have simulated that system through *y*_{t-2},* y*_{t-1}, *y*_{t}, etc. We would like to get a simulated state for *y*_{t+1}. We could develop an equation for *y*_{t+1} in terms of the prior points. In the Forward Euler method (the simplest of explicit methods), the update equation involves only *y*_{t} and looks like this:

y_{t+1} = y_{t }- Δt y_{t}^{2}

More accurate explicit methods take advantage of *y*_{t-1},* y*_{t-2}, and so on. Regardless, this is an **explicit** method because* y*_{t+1} is expressed in terms of stuff that is already been calculated. We can simply calculate *y*_{t+1} by evaluating the right hand side. But what if we assumed that we could know the future points as well? We could know, but we don't. We could develop an equation for *y*_{t+1} looking forward as easily as we could looking backward. The simplest case is the so called Backward Euler method. That update equation looks like this:

y_{t+1} = y_{t }- Δt y_{t+1}^{2}

The difference is subtle. But in the Backward Euler update equation, *y*_{t+1} occurs on both sides of the equation. That is what makes the Backward Euler method an **implicit** method. We can't simply calculate *y*_{t+1} from the right hand side. We have to solve for *y*_{t+1}. In the case of the Backward Euler method, that means solving a quadratic equation (the *y*_{t+1}^{2} on the right hand side makes it quadratic). In more accurate implicit methods the equation to be solved is more difficult. Usually we end up applying Newton iteration, which means we have to solve some linear systems and that brings all of the machinery and all of the issues from the previous section (condition number, etc.) into play. We may or may not carry Newton's method to convergence, depending on the method. We may do just a couple of Newton iteration steps, or some other fixed-point iteration, but almost never, in the real world, do we get too far from solving linear systems, sometimes over and over again.

Implicit methods are considerably slower because they have to solve systems of equations. The advantage of doing the extra work of the Newton fixed-point iteration is that for stiff systems, the implicit methods tend to expand the stability region, meaning that we can simulate the system's evolution without engulfing it in error or resorting to extremely small timesteps. Here is the stability region for the Backward Euler method.

This may look odd, as the stability region contains the whole complex plane except for a hole on the right. This is actually not that unusual. Stability regions are often very oddly shaped. The important point is that the whole of the negative real axis is contained in the stability region with this implicit method. It is insensitive to stiffness, at least from a stability perspective. It may not be accurate, but errors will not grow exponentially. There are more accurate implicit methods, of course, and the ones we are interested in share this property of being insensitive, or less sensitive, to stiffness.

So what does this mean for real problems? Let's look at a very simple ordinary differential equation that exhibits stiffness:

*y' = y*^{2} - y^{3}^{ } and we will let* y(0) = 0.01 *for our initial condition

This is about the simplest IVP we could write down that has any interesting qualities (i.e. it's nonlinear and stiff). Now let's simulate that system with an explicit method, and an implicit method. For purposes of display, we've also run an extremely high precision method that takes a large amount of compute time in order to get the actual solution.

The first thing to notice is that the implicit method gives almost an exact copy of the actual solution. The second thing to notice is that when the system begins to expand, the explicit method is late. It doesn't expand until about 15 time units later. This is a form of hysteresis. Because explicit methods rely only on past data, they tend to be late to catch the wave. Implicit methods are sometimes out in front of the wave by just a bit because they use implicit future values. That's what we see here. We also see, prominently, that the explicit method never really settles down even though the real system settles down nicely after about *t*=105. The explicit method is constantly generating variation that just isn't there in the real system. That is an artifact of the numerical method employed, and that artifact can obviously be avoided by a better choice of method (the implicit method proves that). A smaller timestep minimizes the ringing (figure below), but the hysteresis can pretty much never be wrung out of the explicit method. It will always lag.

None of this means that implicit methods are "better" than explicit methods. That's not the point. Implicit methods are just better for this problem! There are lots of problems where the additional speed of the explicit methods more than pays for itself. If you are trying to predict the position of a satellite in stable orbit around the Earth, the equations are not very stiff and explicit methods like the standard Runge-Kutta-Fehlberg (aka RK45) will probably do a very good job. If we complicate the problem so that instead of one satellite it's two or three, and instead of small satellites they are large gravitational bodies that exert significant force on one another (an n-body problem) then it depends, because n-body systems sometimes experience high frequency oscillations. Here is an example of a complex planar 3-body problem in *Mathematica*.

This system has a great deal more stiffness than a stable satellite orbit. All of the bodies start along relatively smooth, non-stiff paths. But soon the red and blue bodies begin to influence each other strongly. Then the green and blue bodies influence one another. Finally, the green and red bodies fall into very tight interaction as the blue body smoothly diverges. There are low frequency, mid-frequency, and high frequency behaviors, and an adaptive, stiffness-sensitive method is most appropriate. But with only slight changes in system parameters (see the masses m_{1}, m_{2}, m_{3}, on the sliders at the top of the figure) the system exhibits smooth behavior. And returning to the dilemma of our astronaut tucked in his capsule with thrusters rumbling, "all systems go", he probably wants to know with very high certainty which of these pictures best reflects his upcoming journey. Other kinds of analysis, like an examination of the phase space to understand the impact of the initial conditions on the attractor the system finally settles in, can be an excellent compliment to the simulation, and may help to steady our poor astronaut's nerves.

All this brings us back to a common refrain and final conclusion. This stuff is not easy. It wasn't easy for Gene Isaacson and Herb Keller, it isn't easy for people like John Butcher and Nick Trefethen who have followed them, and it's not easy for us. But the consequences of getting things wrong can be great, and those who depend on automatic calculation owe it to themselves to get it right.

This material is not learned very well by just sitting around programming stuff. The hacker kid down the street is probably not going to succeed at this. For one thing, these are only partly programming problems. Their solution requires great intellectual preparation and practice. It requires taking some very difficult and advanced courses, and participating in community...and it also requires sitting around programming stuff. We think *S3 Data Science* is pretty uniquely capable in these areas, and we think that if you have some difficult problems in automated calculation, we are likely to be able to rise to the task.

S3 Data Science, copyright 2016.