{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Histogramming and Binning Data with Python" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 1. Histogramming" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "When a measurement is made numerous times, it is often useful to bin (or group) the data\n", "and make a histogram. For example, if the time that it takes a sphere to roll down a ramp\n", "was measured one hundred times, then a histogram of the times would show how they are\n", "distributed. The **`hist`** function from the pylab library is useful for making histograms. The example below makes a histogram from a list of 24 numbers. \n", "You can add labels to the histogram like othe graphs." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 1, "metadata": { "image/png": { "height": 250, "width": 364 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "hist(t)\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The first line imports the pylab library, which makes the **`hist`** function available.\n", "\n", "As for other plotting commands, the **`figure`** and **`show`** functions are also needed.\n", "\n", "By default, the histogram will have 10 bins. If no additional arguments are sent, the **`hist`** function decides where to put the boundaries of the bins." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The **`color`** argument can be used to set the color of the bars in the histogram.\n", "Alternatively, the **`edgecolor`** and **`facecolor`** arguments separately set the colors of\n", "the edges and middle of the bars in the histogram, respectively. Some of other color\n", "options are:\n", "
    \n", "r = red
    \n", "g = green
    \n", "b = blue
    \n", "k = black
    \n", "c = cyan
    \n", "m = magenta
    \n", "y = yellow
    \n", "w = white\n", "
\n", "\n", "The default is for the **`edgecolor`** to be the same as the **`facecolor`**. The bins stand out better if the **`edgecolor`** is black. \n", "\n", "The **`facecolor`** argument can also be set to \"None\" so that the bars only have outlines. Alternatively, you can set **`fill`** to False. This is useful if you want to plot data on top of the histograms as shown further below." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 2, "metadata": { "image/png": { "height": 250, "width": 364 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "hist(t, facecolor='b', edgecolor='k')\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The **`hist`** function returns the number of events in each bin, the edges of the bins, and\n", "things called patches (which will not be discussed further). These values can be captured\n", "by providing three variable names for them as follows." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 2. 1. 1. 2. 4. 6. 5. 1. 1. 1.]\n", "[ 2.74 3.206 3.672 4.138 4.604 5.07 5.536 6.002 6.468 6.934 7.4 ]\n" ] }, { "data": { "image/png": "" }, "execution_count": 3, "metadata": { "image/png": { "height": 250, "width": 364 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "events, edges, patches = hist(t, edgecolor='k')\n", "print(events)\n", "print(edges)\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The array `events` contains the numbers of occurences in the 10 bins. The array `edges`\n", "contain 11 elements. (The first 10 elements are the lower edges of the bins and the final element is the upper edge of the final bin.) The bins are the same width, but the edges may end up in unusual places. A number is included in a bin if it is greater than or equal to its lower edge and less than its upper edge." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "If you set the **`density`** argument to “True”, the function will make an area-normalized\n", "histogram. For each bin, the height on the histogram is the probability density, which is\n", "the number of events in the bin divided by the total number of events and the width of the\n", "bin. The area of each bin in the histogram is the probability of an event being in that bin,\n", "so the total area is one. With this option, the probability density is returned instead of the\n", "number of events. Compare the example below with the previous example." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.1788269 0.08941345 0.08941345 0.1788269 0.35765379 0.53648069\n", " 0.44706724 0.08941345 0.08941345 0.08941345]\n", "[ 2.74 3.206 3.672 4.138 4.604 5.07 5.536 6.002 6.468 6.934 7.4 ]\n" ] }, { "data": { "image/png": "" }, "execution_count": 4, "metadata": { "image/png": { "height": 250, "width": 373 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "events, edges, patches = hist(t, density=True, edgecolor='k')\n", "print(events)\n", "print(edges)\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "You can control the number of bins by setting the **`bins`** argument to an integer, but this doesn’t control the locations of the edges. Choosing an appropriate number of bins is important. If there are too few or too many bins, the histogram won’t show how the events are distributed very well. For example, the same example data is histogrammed below with 3 and 30 bins." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 5, "metadata": { "image/png": { "height": 250, "width": 370 } }, "output_type": "execute_result" }, { "data": { "image/png": "" }, "execution_count": 5, "metadata": { "image/png": { "height": 250, "width": 373 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "hist(t, bins=3, edgecolor='k')\n", "\n", "figure()\n", "hist(t, bins=30, edgecolor='k')\n", "\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "If you want to have control over the number and location of the bins, you can make the\n", "**`bins`** argument an array. If you want *N* bins, the array will have (*N* + 1) elements. The\n", "first *N* elements are the lower edges of the bins and the final element is the upper edge of\n", "the final bin. Usually the bins have equal widths, but they can be made unequal. The array can be made with the **`linspace`** function from the scipy library, which will need to be imported.\n", "You must specify the first element of the array (the lower edge of the first bin), the last\n", "element of the array (the upper edge of the final bin), and the number of elements in the\n", "array (one more than the number of bins). The example below would produce2 10 bins\n", "(not 11) starting at 0 and ending at 10. For the example data, some of the bins are\n", "empty and aren't displayed." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 6, "metadata": { "image/png": { "height": 250, "width": 370 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "from scipy import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "bins = linspace(0, 10, 11)\n", "figure()\n", "hist(t, bins, edgecolor='k')\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "It is also possible to set the upper and lower limits of the bins using the **`range`** argument.\n", "Values outside of the specified range are ignored. The following example does the same\n", "as the previous example because the default number of bins is 10." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]\n" ] }, { "data": { "image/png": "" }, "execution_count": 7, "metadata": { "image/png": { "height": 250, "width": 370 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "figure()\n", "events, edges, patches = hist(t, range=(0.0,10.0), edgecolor='k')\n", "print(edges)\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Note that in all of the examples above the center of each bin is placed midway between the edges, which define what values are counted in that bin. If the values being histogrammed are all integers, it makes more sense for the to shift the bins to the left so that they are centered over integers. Setting **`align`** to \"left\" will put the center of the bin over the left edge, which will center them over integers." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 8, "metadata": { "image/png": { "height": 250, "width": 375 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "N = array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])\n", "\n", "figure()\n", "hist(t, range=(0.0,10.0), edgecolor='b',align='left')\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "If the bins aren't filled, you can graph points (using **`scatter`**) or curves (using **`plot`**) on the same figure. If the bins are filled, they can hide the points or curves." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "" }, "execution_count": 9, "metadata": { "image/png": { "height": 250, "width": 375 } }, "output_type": "execute_result" } ], "source": [ "from pylab import *\n", "N = array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])\n", "bins = linspace(0, 10, 11)\n", "\n", "x = array([2,3,4,5,6,7])\n", "y = array([1,2,3,12,3,2])\n", "\n", "figure()\n", "hist(t, bins, edgecolor='b', fill=False,align='left')\n", "scatter(x,y,c='g')\n", "show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 2. Binning Data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Sometimes data is binned before it is analyzed. For example, a set of decay times could\n", "be binned before fitting the data to an exponential function. The **`histogram`** function\n", "from the numpy library can be used to bin data without making a plot. The **`histogram`** function is similar to the **`hist`** function described in the previous section. The **`range`** and **`bins`** arguments can be used, but it doesn’t return patches. Associating the locations of the bins and the numbers of events in them is a little tricky\n", "because the `edges` array is one element longer than the `events` array. \n", "\n", "If your counting the occurences of integers, the lower edges are the appropriate thing to use. In the example below, the **`resize`** function makes an array called\n", "`lower` which has a length one less than the length of the `edges` array, so it just contains the lower edges." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]\n", "[ 0 0 2 2 4 13 2 1 0 0]\n" ] } ], "source": [ "from numpy import *\n", "N = array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])\n", "\n", "events, edges = histogram(t,range=(0.0,10.0))\n", "lower = resize(edges, len(edges)-1)\n", "\n", "print(lower)\n", "print(events)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "For non-integer data, it makes more sense to associate the number of events with the center of bin. For example, the number\n", "of event wiht values of `t` between 0 and 1 should be associated with 0.5. The example\n", "below will make an array called `tmid` which is the same length as `events` and contains\n", "the values of `t` in the middle of the bins. Again, the **`resize`** function makes an array called\n", "`lower` which contains the locations of the lower edges of the bins because the final element is dropped.\n", "An array containing the difference between consecutive elements of the `edges` array is returned by the function **`diff`**. \n", "Adding half of the difference between the edges to the\n", "lower edge gives the value in the middle of a bin. Note that \"`diff(edges)`\" is the same\n", "length as `lower`. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5]\n", "[ 0 0 2 2 4 13 2 1 0 0]\n" ] } ], "source": [ "from numpy import *\n", "t = array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,\n", " 2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,\n", " 3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])\n", "\n", "events, edges = histogram(t,range=(0.0,10.0))\n", "lower = resize(edges, len(edges)-1)\n", "tmid = lower + 0.5*diff(edges)\n", "\n", "print(tmid)\n", "print(events)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Additional Documentation" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Further information is available at: \n", "http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.hist \n", "http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2 (SageMath)", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 0 }