next up previous
Next: About this document ...

The least-squares method for fitting data

In many experiments, we gather data which is assumed to satisfy some functional relationship between the relevant variables. For example, exponential population growth is described by the function

P(t)=P0(1+r)t,

where P0 is the population in the initial year, r is the annual growth rate, and P(t) is the population after t years (of course, some unit of time other than years could be adopted). To test whether a certain population is growing exponentially, one gathers data points by repeatedly performing a census. The result is a collection of points

\begin{displaymath}(t_j,P_j^{obs}),\ j=1,2,\ldots,k,
\end{displaymath}

where Pjobs is the observed population at time tj. In this notation, the census has been performed k times. The assumed functional relationship is that

\begin{displaymath}P_j^{obs}=P(t_j),\ j=1,2,\ldots,k.
\end{displaymath}

The example of population growth illustrates a common situation that arises after data have been collected: the function that is assumed to explain the data is expressed in terms of some unknown parameters. Without knowing the values of the parameters, we cannot determine whether the collected data agree with the hypothesized relationship between the variables. The usual response to this situation is to determine whether any choice of values for the parameters leads to a model that fits the data well.

We are now faced with this problem: find values of P0 and r so that the hypothesized relationship,

\begin{displaymath}P_j^{obs}=P_0(1+r)^{t_j},\ j=1,2,\ldots,k,
\end{displaymath}

holds ``as nearly as possible''. (From now on, we restrict ourselves to the example of exponential population growth. Extending these ideas to other experiments and hypothesized functional relationships is left to the reader.) To help decide how to define ``as nearly as possible,'' we now look at a specific example.

Table 1 gives the total population of the United States, according to the official US census, from 1790 to 1960.

 
Table 1: US population according to the official national census.
year population
1790 3893874
1800 5084912
1810 6807786
1820 10037323
1830 12785928
1840 16987946
1850 23054152
1860 31183582
1870 38155505
1880 49371340
1890 62116811
1900 74607225
1910 91641195
1920 105273049
1930 122288177
1940 130962661
1950 149877932
1960 178554916
 

We plot these data points in Figure 1, together with the exponential P(t)=P0(1+r)t-1790, where P0 is chosen to be the US population in 1790, and r is chosen arbitrarily to be 0.02 ($2\%$).

As we see from Figure 1, the exponential model does not perfectly fit the data; that is, the data points do not all lie on the graph of the exponential function. Perhaps there are different choices of P0 and rwhich would lead to a better fit, but before we can find such values, we should agree on what is meant by ``better.''

  
Figure 1: The data from Table 1 together with a proposed exponential model ( P0=3893874, r=0.02).
\begin{figure}\begin{center}
\leavevmode
\epsfbox{misfit.eps}
\end{center}\end{figure}

Here is a proposal for measuring how well a model fits the data: define the misfit between the model and the data to be

 \begin{displaymath}
\sum_{j=1}^k\left(P_0(1+r)^{t_j-1790}-P_j^{obs}\right)^2.
\end{displaymath} (1)

This formula adds up the squares of the differences between the predictions of the model and the actual observed population for each data point. When the differences between the model prediction and the data are small, formula (1) will give a small results, while large differences will give a large result in (1). Therefore, at least in some sense, (1) gives some basis for comparing how two different models (that is, two different choices of P0 and r) fit the data.

The particular form of (1) probably raises questions in the reader's mind. In particular: Why is the difference between the model prediction and the observed datum squared? Would not the formula

 \begin{displaymath}
\sum_{j=1}^k\left\vert P_0(1+r)^{t_j-1790}-P_j^{obs}\right\vert
\end{displaymath} (2)

be a more natural choice? Or perhaps we should use the formula

 \begin{displaymath}
\max_{j=1,\ldots,k}\left\vert P_0(1+r)^{t_j-1790}-P_j^{obs}\right\vert,
\end{displaymath} (3)

which judges the goodness of a model by the largest individual error, rather than by totaling all errors.

In fact, an argument can be made for any of the choices (1), (2), (3). Two arguments in favor of using (1), the sum of the squared errors are:

1.
The resulting problem, choosing the parameters P0 and r to minimize (1), is easier to solve (as compared to the problems that must be solved if (2) or (3) is used to measure the misfit).
2.
If it is assumed that the observed data are actually equal to model predictions (with the correct choice of model parameters), except for measurement noise, and if it is assumed that the measurement errrors are normally distributed (that is, the errors can be described by the famous ``bell curve''), then (1) is the natural choice for statistical reasons. (We will not try to explain these reasons here.)
Under other assumptions about the statistical nature of the errors, either (2) or (3) is more natural.

From now on, we will assume that good choices for P0 and r are those which make (1) small. We refer to (1) as the least-squares misfit for the model parameters P0 and r. For the sake of example, we compute the least-squares misfit:

These results suggest that (P0,r)=(3893874,0.025) is a better choice than (P0,r)=(3893874,0.02). This is conclusion is supported by a comparison of Figures 1 and 2.
  
Figure 2: The data from Table 1 together with a proposed exponential model ( P0=3893874, r=0.025).
\begin{figure}\begin{center}
\leavevmode
\epsfbox{misfit1.eps}
\end{center}\end{figure}

For convenience, let us write

\begin{displaymath}J(r,P_0)=\sum_{j=1}^k\left(P_0(1+r)^{t_j-1790}-P_j^{obs}\right)^2.
\end{displaymath}

We can now begin to use the function J to choose the model parameters P0 and r. For our first attempt, suppose we decide to take as P0the population of the US in 1790: P0= 3893874. We can then ask the question: What is the best choice of r? This is a question naturally addressed using the methods of calculus: given a function of one variable, find the value of the variable giving the minimum value of the function. Here the function of one variable is J(r,P0), and the variable is r(remember that we are regarding P0 as fixed). Without using calculus techniques, we can solve the problem from a graph. Figure 3 suggests that there is a unique value of r that minimizes J(r,P0), with P0=3893874.
  
Figure 3: The graph of J(r,P0) with P0 fixed at 3893874.
\begin{figure}\begin{center}
\leavevmode
\epsfbox{misfit2.eps}
\end{center}\end{figure}

By zooming in on this graph, we can see that this optimal value of ris about 0.0235. The graph of the resulting exponential model ( P0=3893874, r=0.0235), along with the data, is given in Figure 4.
  
Figure 4: The best exponential fit to the population data using P0=3893874.
\begin{figure}\begin{center}
\leavevmode
\epsfbox{misfit3.eps}
\end{center}\end{figure}

The fit to data given in Figure 4 is not particularly good, which suggests that the population is not growing in accordance with an exponential model, at least not one in which the initial population is constrained to be the actual initial population (as measured by the US census) in 1790. Perhaps we can find a better exponential model by choosing a different value for the parameter P0.

We really want to mimize J(r,P0) by choosing r and P0 independently. This is the problem of minimizing a function of two variables, and it is also addressed by calculus techniques, in this case, techniques from multivariable calculus. We can also attempt a graphical solution, but, as Figure 5 shows, it is rather difficult to find the lowest point on the surface z=J(r,P0) (or, indeed, to be certain that there is a lowest point).

  
Figure 5: The graph of the surface z=J(r,P0).
\begin{figure}\begin{center}
\leavevmode
\epsfbox{surf.eps}
\end{center}\end{figure}

It turns out that, due to the special structure of J, we can minimize J(r,P0) without knowing multivariable calculus. This is because the function J is, for each fixed r, a quadratic in the parameter P0. We will show below that we can write

\begin{displaymath}J(r,P_0)=a_rP_0^2+b_rP_0+c_r,\ a_r>0,
\end{displaymath}

where ar, br, and cr are real numbers depending on r. Most readers probably know how to find the value of P0 minimizing this quadratic; it is the value

\begin{displaymath}P_0(r)=-\frac{b_r}{2a_r}.
\end{displaymath}

We write P0(r) to indicate this special value of P0. The point (r,P0(r),J(r,P0(r))), as r varies, follows the bottom of the valley discernible in Figure 5. By minimizing $\hat{J}(r)=J(r,P_0(r))$ as a function of r, we can find the bottom of the valley, which corresponds to the desired values of r and P0.

We now perform the algebraic manipulations to show that J(r,P0) is a quadratic in P0 for each fixed r. We have

\begin{eqnarray*}J(r,P_0)&=&\sum_{j=1}^k\left(P_0(1+r)^{t_j-1790}-P_j^{obs}\righ...
...j=1}^k\left(P_j^{obs}\right)^2\right)\\
&=&a_rP_0^2+b_rP_0+c_r,
\end{eqnarray*}


where

\begin{eqnarray*}a_r&=&\sum_{j=1}^k(1+r)^{2(t_j-1790)},\\
b_r&=&-2\sum_{j=1}^kP...
...(1+r)^{t_j-1790},\\
c_r&=&\sum_{j=1}^k\left(P_j^{obs}\right)^2.
\end{eqnarray*}


We then have P0(r)=-br/(2ar).

We now minimize

\begin{displaymath}\hat{J}(r)=J(r,P_0(r))
\end{displaymath}

by examining its graph, which is given in Figure 6. By zooming in on this graph, we find that the minimum value of $\hat{J}$ is attained at $r^*\doteq 0.016906$, and the corresponding value of P0 is

\begin{displaymath}P_0^*P_0(r^*)\doteq 1.07284\cdot 10^7.
\end{displaymath}


  
Figure: The graph of $\hat{J}(r)=J(r,P_0(r))$.
\begin{figure}\begin{center}
\leavevmode
\epsfbox{Jhat.eps}
\end{center}\end{figure}

The corresponding ``best-fit'' exponential is displayed in Figure 7, together with the original population data.
  
Figure: The best exponential fit to the population data ( $r\doteq 0.016906$, $P_0\doteq 1.07284\cdot 10^7$).
\begin{figure}\begin{center}
\leavevmode
\epsfbox{bestfit.eps}
\end{center}\end{figure}

By comparing Figures 2, 3, 4, and 7, it seems apparent that we really did find a better fit to the data by using the least-squares approach. However, it is also fair to question whether the population growth of the United States really follows an exponential trend. One way to investigate this question is to ask whether the model that we have found is able to predict future population growth. Figure 8 show the data, our best-fit exponential, and another data set, consisting of the yearly populations of the United States from 1900 to 1999 (this new data set also comes from the US census). As this figure shows, the best-fit exponential significantly overestimates the population growth after 1960.

  
Figure: The best exponential fit to the population data ( $r\doteq 0.016906$, $P_0\doteq 1.07284\cdot 10^7$), together with further population data. The graph of the best-fit exponential is dashed, the first data set corresponds to the large dots, and the second data set to the small dots.
\begin{figure}\begin{center}
\leavevmode
\epsfbox{bestfit1.eps}
\end{center}\end{figure}

Of course, we could now combine the two data sets and find a new best-fit exponential agreeing with all of the data as closely as possible. However, it may be better to consider a different model of population growth.


 
next up previous
Next: About this document ...
Mark S. Gockenbach
2000-09-28