Next: About this document ...
The least-squares method for fitting data
In many experiments, we gather data which is assumed to satisfy some
functional relationship between the relevant variables. For example,
exponential population growth is described by the function
P(t)=P0(1+r)t,
where P0 is the population in the initial year, r is the annual
growth rate, and P(t) is the population after t years (of course,
some unit of time other than years could be adopted). To test whether
a certain population is growing exponentially, one gathers data points
by repeatedly performing a census. The result is a collection of
points
where Pjobs is the observed population at time tj. In this
notation, the census has been performed k times. The assumed functional
relationship is that
The example of population growth illustrates a common situation that arises
after data have been collected: the function that is assumed to explain
the data is expressed in terms of some unknown parameters. Without knowing
the values of the parameters, we cannot determine whether the collected
data agree with the hypothesized relationship between the variables.
The usual response to this situation is to determine whether any choice
of values for the parameters leads to a model that fits the data well.
We are now faced with this problem: find values of P0 and r so that
the hypothesized relationship,
holds ``as nearly as possible''. (From now on, we restrict ourselves to
the example of exponential population growth. Extending these ideas to
other experiments and hypothesized functional relationships is left to
the reader.) To help decide how to define ``as nearly as possible,''
we now look at a specific example.
Table 1 gives the total population of the United States,
according to the official US census, from 1790 to 1960.
Table 1:
US population according to the official national census.
| year |
population |
| 1790 |
3893874 |
| 1800 |
5084912 |
| 1810 |
6807786 |
| 1820 |
10037323 |
| 1830 |
12785928 |
| 1840 |
16987946 |
| 1850 |
23054152 |
| 1860 |
31183582 |
| 1870 |
38155505 |
| 1880 |
49371340 |
| 1890 |
62116811 |
| 1900 |
74607225 |
| 1910 |
91641195 |
| 1920 |
105273049 |
| 1930 |
122288177 |
| 1940 |
130962661 |
| 1950 |
149877932 |
| 1960 |
178554916 |
|
We plot these data points in Figure 1, together with the
exponential
P(t)=P0(1+r)t-1790, where P0 is chosen to be the
US population in 1790, and r is chosen arbitrarily to be 0.02 (
).
As we see from Figure 1, the exponential model does not perfectly
fit the data; that is, the data points do not all lie on the graph of the
exponential function. Perhaps there are different choices of P0 and rwhich would lead to a better fit, but before we can find such values, we
should agree on what is meant by ``better.''
Figure 1:
The data from Table 1 together with a proposed
exponential model (
P0=3893874, r=0.02).
 |
Here is a proposal for measuring how well a model fits the data: define
the misfit between the model and the data to be
 |
(1) |
This formula adds up the squares of the differences between the predictions
of the model and the actual observed population for each data point.
When the differences between the model prediction and the data are small,
formula (1) will give a small results, while large differences will
give a large result in (1). Therefore, at least in some sense,
(1) gives some basis for comparing how two different models (that is,
two different choices of P0 and r) fit the data.
The particular form of (1) probably raises questions in the reader's
mind. In particular: Why is the difference between the model prediction and
the observed datum squared? Would not the formula
 |
(2) |
be a more natural choice? Or perhaps we should use the formula
 |
(3) |
which judges the goodness of a model by the largest individual error, rather
than by totaling all errors.
In fact, an argument can be made for any of the choices (1),
(2), (3). Two arguments in favor of using
(1), the sum of the squared errors are:
- 1.
- The resulting problem, choosing the parameters P0 and r to minimize
(1), is easier to solve (as compared to the problems that must be
solved if (2) or (3) is used to measure the misfit).
- 2.
- If it is assumed that the observed data are actually equal to model
predictions (with the correct choice of model parameters), except for
measurement noise, and if it is assumed that the measurement errrors are
normally distributed (that is, the errors can be described by the
famous ``bell curve''), then (1) is the natural choice for
statistical reasons. (We will not try to explain these reasons here.)
Under other assumptions about the statistical nature of the errors, either
(2) or (3) is more natural.
From now on, we will assume that good choices for P0 and r are those
which make (1) small. We refer to (1) as the least-squares
misfit for the model parameters P0 and r. For the sake of example,
we compute the least-squares misfit:
- for
P0=3893874 (the actual population in 1790) and r=0.02.
The result is
.
- for
P0=3893874 (the actual population in 1790) and r=0.025.
The result is
.
These results suggest that
(P0,r)=(3893874,0.025) is a better choice
than
(P0,r)=(3893874,0.02). This is conclusion is supported by a
comparison of Figures 1 and 2.
Figure 2:
The data from Table 1 together with a proposed
exponential model (
P0=3893874, r=0.025).
 |
For convenience, let us write
We can now begin to use the function J to choose the model parameters
P0 and r. For our first attempt, suppose we decide to take as P0the population of the US in 1790:
P0= 3893874. We can then ask the
question: What is the best choice of r? This is a question naturally
addressed using the methods of calculus: given a function of one variable,
find the value of the variable giving the minimum value of the function.
Here the function of one variable is J(r,P0), and the variable is r(remember that we are regarding P0 as fixed). Without using calculus
techniques, we can solve the problem from a graph. Figure 3
suggests that there is a unique value of r that minimizes J(r,P0),
with
P0=3893874.
Figure 3:
The graph of J(r,P0) with P0 fixed at 3893874.
 |
By zooming in on this graph, we can see that this optimal value of ris about 0.0235. The graph of the resulting exponential model
(
P0=3893874, r=0.0235), along with the data, is given in Figure
4.
Figure 4:
The best exponential fit to the population data using
P0=3893874.
 |
The fit to data given in Figure 4 is not particularly good, which
suggests that the population is not growing in accordance with an
exponential model, at least not one in which the initial population is
constrained to be the actual initial population (as measured by the US
census) in 1790. Perhaps we can find a better exponential model by
choosing a different value for the parameter P0.
We really want to mimize J(r,P0) by choosing r and P0 independently.
This is the problem of minimizing a function of two variables, and it
is also addressed by calculus techniques, in this case, techniques from
multivariable calculus. We can also attempt a graphical solution, but, as
Figure 5 shows, it is rather difficult to find the lowest point
on the surface
z=J(r,P0) (or, indeed, to be certain that there is a
lowest point).
Figure 5:
The graph of the surface
z=J(r,P0).
 |
It turns out that, due to the special structure of J, we can minimize
J(r,P0) without knowing multivariable calculus. This is because the
function J is, for each fixed r, a quadratic in the parameter P0.
We will show below that we can write
where ar, br, and cr are real numbers depending on r.
Most readers probably know how to find the value of P0 minimizing
this quadratic; it is the value
We write P0(r) to indicate this special value of P0. The point
(r,P0(r),J(r,P0(r))), as r varies, follows the bottom of the valley
discernible in Figure 5. By minimizing
as
a function of r, we can find the bottom of the valley, which corresponds
to the desired values of r and P0.
We now perform the algebraic manipulations to show that J(r,P0) is
a quadratic in P0 for each fixed r. We have
where
We then have
P0(r)=-br/(2ar).
We now minimize
by examining its graph, which is given in Figure 6. By zooming
in on this graph, we find that the minimum value of
is attained
at
,
and the corresponding value of P0 is
Figure:
The graph of
.
 |
The corresponding ``best-fit'' exponential is displayed in Figure
7, together with the original population data.
Figure:
The best exponential fit to the population data
(
,
).
 |
By comparing Figures 2, 3, 4, and
7, it seems apparent that we really did find a better fit
to the data by using the least-squares approach. However, it is also
fair to question whether the population growth of the United States
really follows an exponential trend. One way to investigate this question
is to ask whether the model that we have found is able to predict future
population growth. Figure 8 show the data, our best-fit
exponential, and another data set, consisting of the yearly populations
of the United States from 1900 to 1999 (this new data set also comes
from the US census). As this figure shows, the best-fit exponential
significantly overestimates the population growth after 1960.
Figure:
The best exponential fit to the population data
(
,
), together with further
population data. The graph of the best-fit exponential is dashed, the
first data set corresponds to the large dots, and the second data set
to the small dots.
 |
Of course, we could now combine the two data sets and find a new best-fit
exponential agreeing with all of the data as closely as possible. However,
it may be better to consider a different model of population growth.
Next: About this document ...
Mark S. Gockenbach
2000-09-28