Fitting Functions to Data:
Linear and Exponential Regression
miscellaneous on-line topics for
Finite Mathematics and Calculus
Applied to the Real World

              Return to Main Page
Index of On-Line Topics
Exercises for This Topic
Utility: On-Line Regression Utilty
Everything for Calculus
Everything for Finite Math
Everything for Finite Math & Calculus

We have often seen algebraically defined functions used to model relationships between variables. For example, a demand function expresses the demand for an item as a function of the unit price, and a cost function expresses total cost as a function of the number of items manufactured. (See the Topic Summary of Functions for some examples.) These functions are often called models.

In Chapter 1 of Calculus Applied to the Real World we see how to obtain a linear or exponential model from two data points: we only need to find the equation of the line or exponential curve passing through them. However, it often happens that we have many data points that don't quite all lie on one line or exponential curve. The problem then is to find the line, exponential curve, quadratic curve, or what have you coming closest to passing through all of the points. the relationship between price and demand, or cost. In this interactive module, we discuss how we can come up with such a model.

1. Best Fit Line (Regression Line)

We start with an attempt to construct a linear demand function. Suppose that your market research of real estate investments reveals the following sales figures for new homes of different prices over the past year.

Price (Thousands of $) $150 - $169 $170 - $189 $190 - $209 $210 - $229 $230 - $249 $250 - $269 $270 - $289
Sales of New Homes This Year 126 103 82 75 82 40 20

If we simplify the situation by replacing each of the price ranges by a single price in the middle of the range, we get the following table:

Price (Thousands of $) $160 $180 $200 $220 $240 $260 $280
Sales of New Homes This Year 126 103 82 75 82 40 20

We would like to use these data to construct a demand function for the real estate market. (Recall that a demand function gives demand y, measured here by annual sales, as a function of unit price, x.) Here is a plot of y versus x.

The data definitely suggest a straight line, more-or-less, and hence a linear relationship between p and q. Here are several possible "straight line fits."

Q Which line best fits the data?
A We would like the sales predicted by the best-fit line (predicted values) to be as close to the actual sales (observed values)as possible. The differences between the predicted values and the observed values appear as the vertical distances shown in the figure below.

Q Since we want the vertical distances to be as small as possible, why can't we set them all to zero and solve for the slope and intercept of the straight line?
A If this were possible, then there would be a straight line that passes through all the data points. A look at the graph shows that this is not the case.

Q Then why not find the line that minimizes all the vertical distances?
A This is not possible either. The line that minimizes the first two distances is the line that passes through the first two data points, since it makes the distances 0. But this line certainly does not minimize the distance to the third point. In other words, there is a trade-off: making some distances smaller makes others larger.

Q So what do we do?
A Since we cannot minimize all of the distances, we minimize some reasonable combination of them. Now, one reasonable combination of the distances would be their sum, but that turns out the be difficult to work with (because distances are measured in terms of absolute values). Instead, we use the sum of the squares of the distances (no absolute values required). The line that minimizes this sum is called the best fit line, regression line, or least squares line associated with the given data.

Q How do we obtain the equation of the best fit line?
A Following is the formula for the best fit straight line. To justify it requires some calculus. If you know about partial derivatives, consult the chapter on the calculus of several variables in Calculus Applied to the Real World for a detailed explanation.

Regression (Best Fit) Line

The best fit line associated with the n points (x1, y1), (x2, y2), . . . , (xn, yn) has the form

    y = mx + b
where
    slope = m=
    n(xy) - (x)(y)

    n(x2) - (x)2

    intercept = b=
    y - m(x)

    n
Here, means "the sum of." Thus
    xy = sum of products = x1y1 + x2y2 + . . . + xnyn
    x = sum of x-values = x1 + x2 + . . . + xn
    y = sum of y-values = y1 + y2 + . . . + yn
    x2 = sum of squares of x-values = x12 + x22+ . . . + xn2

Using the formula above is easy, as the following example shows.

Example 1 Computing a Regression Line by Hand

Find the least squares line associated with the following data:

x1234
y1.51.62.13.0

Solution In order to apply the formula, it is best to organize the data in a table as shown.

To complete the table, do the following:

  1. Fill in the values for xy and x2 in each row and press "Check."
  2. When all your entries are correct, press "Sum" to compute the sum in each column.
  3. To see the correct values, press "Help."
xyxyx2
11.5
21.6
32.1
43.0
x = 10y = 8.2xy = x2 =

Substituting the correct values from the above table into the formula gives

Thus our least squares line is

Before we go on... Here is a plot the data points and the least squares line.

Notice that the line doesn't pass through even one of the original points, and yet it is the straight line that best approximates them.


Let us now return to the data on demand for real estate with which we began this topic.

Example 2 Demand for Homes

Find a linear demand equation that best fits the following data, and use it to predict annual sales of homes priced at $140,000.

x = Price (Thousands of $) $160 $180 $200 $220 $240 $260 $280
y = Sales of New Homes This Year 126 103 82 75 82 40 20

Solution Here is the table we use to organize the calculations.

xyxyx2
16012620,16025,600
18010318,54032,400
2008216,40040,000
2207516,50048,400
2408219,68057,600
2604010,40067,600
280205,60078,400
Sumsx = 1,540y = 528xy = 107,280x2 = 350,000

Substituting these values in the formula gives (n = 7)

Notice that we used the most accurate value, m = -0.7928571429, that we could obtain on our calculator in the formula for b rather than the rounded value -0.7929. This illustrates the following important general guideline:

Thus our least squares line is

We can now use this equation to predict the annual sales of homes priced at $140,000, as we were asked to do. Remembering that x is the price in thousands of dollars, we set x = 140, and solve for y, getting y 139. Thus our model predicts that approximately 139 homes will have been sold in the range $140,000-$159,000.

Before we go on... We must remember that these figures were for sales in a range of prices. For instance, it would be extremely unlikely that 139 homes would have been sold at exactly $140,000. On the other hand, it does predict that, were we to place 139 homes on the market at $140,000, we could expect to sell them all

Here is the original data, together with the least squares line.


Q If the given data points all happen to lie on a straight line, is this the line we get by the best fit method?
A Yes. This has the following implication: you can use linear regression on a graphing calculator to check your calculations of the equation of a straight line passing through two specified points.

Q If the given points do not lie on a straight line, is there a way we can tell how far off they are from lying on a straight line?
A There is a way of measuring the "goodness of fit" of the least squares line, called the coefficient of correlation. This is a number r between -1 and 1. the closer it is to -1 or 1, the better the fit. For an exact fit, we would have r = -1 (for a negative slope line) or r = 1 (for a positive slope line). For a bad fit, we would have r close to 0. The figure below shows several collections of data points with best-fit lines and corresponding values of r.

The correlation coefficient can be calculated with the following formula. (To justify this formula requires a fair knowledge of statistics, so we shall not attempt to do so here.)

Coefficient of Correlation

 
Coefficient of Correlation = r=
n(xy) - (x)(y)

[n(x2) - (x)2]0.5 [n(y2) - (y)2]0.5


2. Best Fit Exponential Curve (Regression Exponential Curve)

Q Now we know how to fit a straight line to given data. What about an exponential curve, of the form

A The idea is to convert an exponential curve to a linear one, using the logarithm, as follows.

Start with the exponential function

and take the logarithm of both sides:

The properties of logarithms give

or

This expresses log(q) as a linear function of t, with

Therefore, if we find the best-fit line using log(q) as a function of t, the slope and intercept will be given as above, and so we can obtain r and A by

To summarize,

Exponential Regression

To obtain a best-fit exponential curve of the form

    q = Art:
  1. Take y = log(q), and x = t.
  2. Find the regression line y = mx + b for the data (x, y).
  3. The desired exponential model is
      q = Art,
    where r = 10m, and A = 10b.


Example 3 Sales of Compaq

Revenues from sales of Compaq computers are shown in the following table, where x represents time in years since 1990.* Obtain an exponential regression model for the data.

t = Year (1990 = 0) 0 2 4 7
R = Revenue ($ billion) 3 4 11 25

* Data are rounded. Source: Company Reports/The New York Times, January 27, 1998, p. D1.

Solution Since we need to model log(R) as a linear function of t, we first make a table with x = t and y = log(R), and then calculate the regression line, y = mx + b.

x (= t) 0 2 4 7
y (= log(R)) 0.477121 0.602060 1.04139 1.39794

Instead of doing this calculation by hand as we did in the above examples, you can do it automatically using the on-line regression utility. Just enter the x- and y-values in the table, and press the "y = mx+b" button. (Yes, that utility does exponential regression as well, but we would like you to know how it works!)

The linear regression model we obtain is

Thus, the desired exponential model is

where r = 10m = 100.13907 1.3774, and A = 100.42765. 2.6770.

This gives our revenue model as

Before we go on... Go to the on-line regression utility, enter the original data (before you took the logarithms) and press the "y = a(b^x)" button. What do you find?


3. Other Forms of Regression

At the on-line regression utility, you can also find best-fit curves of the following forms:

On the TI-83, you will find all of these, as well as the following

Return to Main Page
Index of On-Line Topics
Exercises for This Topic
Utility: On-Line Regression Utilty
Everything for Calculus
Everything for Finite Math
Everything for Finite Math & Calculus

Last Updated:February, 1998
Copyright © 1998 StefanWaner and Steven R. Costenoble