{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For many physical systems, the effect you’re investigating has a simple dependence on a single cause. The simplest interesting dependence is a linear one, where the cause (described by $x$) and the effect (described by $y$) are related by\n", "\n", "\\begin{equation}\n", "y = mx + b. \\tag{5.1}\n", "\\end{equation}\n", "\n", "As you probably recognize, this is a linear relationship. If you plot $y$ vs. $x$, the resulting graph would be a straight line. \n", "\n", "For example, suppose that you are traveling by car and watching the speedometer closely to stay at a constant speed. If you were to record the odometer reading as a function of the amount of time that you’ve been on the road, you would find that a graph of your results was a straight line. In this example, it would be helpful to rewrite that general linear equation (5.1) to fit the specific physical situation and give physical interpretations of all the symbols. We could write the equation as\n", "\n", "\\begin{equation}\n", "d = vt + d_0, \\tag{5.2}\n", "\\end{equation}\n", "\n", "where $d$ is the odometer reading at time $t$, $v$ is the speed, and $d_0$ is the initial odometer reading (at $t=0$). \n", "\n", "Suppose your odometer works correctly, but your speedometer isn’t working properly so that the number the needle is pointing to is not really the speed of the car. It’s working well enough that if you keep the needle pinned at 60 mph, your car is traveling at some constant speed, but you just can’t be confident that the constant speed is in fact 60 mph. You could determine your speed by recording the value that the odometer registers at several different times. If both your odometer and your clock were ideal measuring devices, able to register displacement and time without experimental uncertainty, a graph of the odometer values versus time would lie along a perfectly straight line. The slope of this line would be the true speed corresponding to your chosen constant speedometer reading. \n", "\n", "Of course, your time and distance measurements will always include some experimental uncertainty. Therefore, your data points wll not all lie exactly on the line. The purpose of this chapter is to describe a procedure for finding the slope and intercept of the straight line that “best” represents your data in the presence of the inevitable experimental uncertainty of your measurements. The process of determining such a “best fit” line is called linear regression. Note that it is far better to use a best fit line to a set of data instead of calculating the speed using single measurements of the distance and time. Linear regression allows us to use multiple measurements at once. As you'll see in the next section, it will also allow you to determine the uncertainty of the speed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that you want to fit a set data points $(x_i,y_i)$, where\n", "$i = 1,2,\\ldots,N$, to a straight line, $y=ax+b$. This involves choosing the parameters $a$ and $b$ to minimize the sum of the squares of the differences between the data points and the linear function. The differences are usual defined in one of the two ways shown in figure 5.1. If there are uncertainties in only the y direction, then the differences in the vertical direction (the gray lines in the figure below) are used. If there are uncertainties in both the $x$ and $y$ directions, the orthogonal (perpendicular) distances from the line (the dotted red lines in the figure below) are used.\n", "\n", "