# The Taylor Series and Beyond

*posted by Mosaic Data Science*

In the modern science of data analytics, sometimes oldies are goodies. I once took an optimization class where the answer to every question posed by the professor was “the Taylor series,” referring to a popular numerical method that will be 300 years old next year. Brook Taylor’s 1715 formulation, which can be traced back even further to James Gregory in the seventeenth century, is the foundation of a great many of today’s numerical methods, of which one of the most powerful is nonlinear batch least squares.

Depending on your perspective the Taylor series can be both mundane and profound. Its basic idea is that the value of a function, say y = x 2 , at a particular point, say x = 3.1, can be computed as the value at some other point, say x = 3, with an adjustment to account for the difference. The slope of y = x 2 at x = 3 is 6, so between x = 3 and x = 3.1, the function will increase about 6 × 0.1, or 0.6. So if y is 9 at x = 3 (32 is 9), then y is about 9 + 0.6, or 9.6, at x = 3.1. My calculator tells me the exact value is 9.61.

If you draw this out with pencil and paper you will see the idea is quite simple. If you know the slope, or derivative, of a function, then you can approximate nearby values of the function. I once knew a fellow who could tell you the value of functions, such as the square root or cosine, at arbitrary values faster than you could punch it into a calculator.

But the Taylor series is a bit more profound when you consider higher order derivatives. When we used the first derivative, or slope, of y = x 2 above, we could approximate nearby values fairly accurately. But if we also account for the second derivative—the slope of the slope—we can have the answer exactly.

In fact we could compute the exact value of y = x 2 at any point, knowing only the function value at any given point, and its first and second derivative. Simply put, what the Taylor series tells us is that for well-behaved functions such as y = x 2 , the entire function can be described in terms of information at only a single point in the function.

In practice there are many applications in which there are multiple output and input variables (y1, y2, …, x1, x2, …) and the function derivatives cannot be analytically derived. In such applications the Taylor series can nonetheless be useful.

Consider the problem of modeling an aircraft flight using radar tracking data. If you compute the trajectory using flight segments such as climbs, descents and turns, then you can derive the radar measurements. The difference between your derived measurements and the actual measurements indicates your modeling error. You may have errors in the start and stop times of the segments, or in the aircraft performance, such as the rate of climb, during the segments.

Using the Taylor series idea you can perturb each of the modeling parameters to estimate the slope of each of the output variables (the errors in your derived radar measurements) with respect to each of the input variables (the flight parameters).

You can then adjust the flight parameters in the direction that reduces the radar measurement errors, until those errors are minimized. This method was suggested by Donald Marquardt in 1963 and traces back to Kenneth Levenberg in 1944. There is no guarantee of success and you must check your answer carefully, but the Marquardt-Levenberg algorithm is robust and has been successfully applied to a wide range of problems. In the aircraft trajectory example, the result is a description of the flight, not in terms of a long list of radar measurements, but rather a short list of meaningful performance parameters. These can then be used, for example, to predict future trajectories.

## Mark L. Stone

3 years ago

There are much better, more robust algorithms available to solve nonlinear least squares than Marquardt-Levenberg – this is not 1963. Among other crimes, as in the stories below, Marquardt-Levenberg ignores higher order terms (in this case, using an approximation for the Hessian of the objective function which ignores higher order terms in the Hessian which can be crucial if the residuals at the optimum are not small, and can cause degraded performance far from the optimum, possibly resulting in non-convergence), often to great peril. And of course, the usual Marquardt-Levenberg estimate for the covariance of the estimated parameters is not only based on the potentially very inaccurate linearization inherent in use of the inverse of the Hessian, but is further compounded by the approximation used by Marquardt-Levenberg in calculating the Hessian. Instead, computationally intensive bootstrapping can be used to estimate the uncertainty distribution of the estimated parameters and the model’s predictions. Also, you really need to think about global optimization, recognizing that if the model is non-convex, as most nonlinear least squares problems are, Marquardt-Levenberg and other local optimization algorithms may find a local optimum which is not globally optimal. And for global optimization, I’m not talking about genetic algorithms and other heuristic junk often falsely advertised as being global optimization algorithms or just randomly trying several starting values with a local optimization algorithm.

Two funny Taylor series stories. Lightning does indeed strike twice.

1. May 1980: An M.I.T. Physics undergrad I knew excitedly told me that his Bachelor thesis advisor https://en.wikipedia.org/wiki/Philip_Morrison told him that the results of his Bachelor thesis could be published jointly by them as a refereed letter to the editor of Nature, because his solution to the gravitational time lens problem assigned to him by Prof. Morrison revealed a startling new phenomenon in Cosmology. I asked him to show me his calculation. The calculation, which was simple by the standards of a Math major, such as myself, looked correct to me, until he got to what looked like an elliptic integral of another kind. Then he wrote down the numerical answer, 0.173. I asked him how he numerically evaluated the integral. He told me that he took a three term Taylor series expansion, and integrated term by term. I asked him whether he had bounded he remainder term – he had not. I told him his numerical evaluation could potentially be very inaccurate. I spent several minutes using a hand calculator to apply a standard numerical integration technique with guaranteed error bounds, and got the answer 0.721 +/- 0.001. At that point, I had no idea of the scientific significance of the difference between his and my numerical answers. The next morning, he showed my numerical answer to Morrison, who proclaimed “Oh, that’s just the expected result”. There was no refereed letter to the editor of Nature. The startling new phenomenon in Cosmology was the result of ignoring the remainder term in a Taylor series expansion.

2. January 1986. I was at a briefing to an audience which included the Vice Admiral (3 star) who had to give the final sign off to authorize use of a new model and analysis results for the probability of detecting object X, which was of great importance to the United States Navy. The modeling and analysis was a two year full-time effort by someone who had a Ph.D. in Physics. The model and analysis had already undergone extensive review by government and think tank scientists and analysts,, many of whom had Ph.D.s. The modeler/analyst/briefer projected the transparencies showing the model derivation. At one point, he got a formula, in terms of transcendental functions available on any scientific calculator or scientific programming language, such as FORTRAN, for the probability of Y. The probability of Y was one of the key factors which went into the formula for the probability of X, which was the item of ultimate interest. For reasons known only but to God and himself (perhaps because he had seen it done in Physics classes and books, but was only shown the cases where it wound up giving the correct result), he calculated a two term Taylor series expansion for the probability of Y. He then plugged this two term Taylor series for the probability of Y into the formula for the probability of X, which resulted in the final formula for the probability of X. Using the baseline value of the model parameters, the probability of detecting X was shown to be ~0.8. I was rather suspicious of the Taylor series approximation, so I got out my credit card-sized battery operated Casio calculator (which could operate in the darkened conference room) and calculated that using baseline values of the parameters, the two term Taylor series approximation of the probability of Y, which was a factor in the calculation of the probability of X, was ~ 1.2. Indeed, I could see that using non-baseline values of the parameters, this “probability” could be as high as ~1.8. Being an analyst full of youthful exuberance, more so than of diplomacy, I proclaimed “We have just witnessed a feat more astounding than the breaking of the sound barrier in aviation, namely, the breaking of the unity barrier in probability”. It turns out that neither the modeler/analyst nor any of the reviewers had ever numerically evaluated the two term Taylor series expansion for the probability of Y, and had no idea it could exceed 1, let alone that it did so using the baseline parameter values. The Vice Admiral stood up, thanked me for pointing out the flaw in the model, and said that he would not sign off on the model and analysis. The project was reassigned.

## Mark L. Stone

3 years ago

By the way, the two term Taylor series for y = x^2 give the answer exactly only because it is a quadratic function, for which all higher than 2nd order Taylor series terms are exactly zero. That is not the case case for functions other than quadratics.