{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "##

Lab D: The $\\chi^2$ Distribution

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Introduction**\n", "\n", "One of the metaphors we introduced to talk about hypothesis testing was the \"innocent until proven guilty\" one. Specifically, we assume that a parameter has a particular value, and stick to this assumption unless there is strong evidence against it. \n", "\n", "Thus far we have seen this paradigm at work in the setting of **means**, **proportions**, **differences of means**, and **differences of proportions.** In all four cases, the evidence we gathered consists of a single number, a *point estimate* (calculated from a sample.) If the point estimate is close to the assumed value of the statistic, we \"do nothing\" (i.e. assume the null is *innocenct*), but if it's far away, we \"reject the null\" (i.e. *convict* the null of being wrong.) \n", "\n", "We now want to extend this basic paradigm to the setting of categorical data involving two or more categories. Here it will turn out that the \"evidence\" is not a direct estimate of a parameter value, but rather a composite score which takes on the value of 0 if the data exactly reflects the null hypothesis, and grows as the data deviates from that hypothesis.\n", "\n", "To illustrate the ideas, we borrow from the Alameda County jury information in Table 7.2 of the text (pg. 464.)\n", "\n", "\n", "** The data**\n", "\n", "A total of 1453 individuals from over 10 jury selection pools were polled as to their race or ethnic background. This information was then compared against demographic statistics from the latest census. The results were as follows:\n", "\n", "\n", "| Race | White\t| Black | Hispanic | Asian | Other |\n", "|------|--------|-------|----------|-------|-------|\n", "|Number in jury pools| 780 |\t117 | 114\t| 384 |\t58 |\n", "|Census percentage | 54% | 18% | 12% | 15% | 1% |\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pause for reflection #1:** *For each racial/ethnic category, calculate the proportion of the people in the jury pools that belong to this category. Comment how your proportions align or don't align with the percentages given by the census. Without using any statistical techniques whatsoever, does it seem that your data is consistent with the hypothesis that \"members of the jury are selected without regard to race or ethnic background?\"*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pause for reflection #2:** *Now suppose that the above hypothesis were true, i.e. that jurors were selected without regards to race or ethnicity. How many individuals in each category would you expect? (You might get a fractional number; that's OK!) Explain your reasoning.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The $\\chi^2$ statistic is given by\n", "\n", "$$\n", "\\chi^2 = \\sum_{\\mbox{categories}} \\frac{(\\mbox{observed}-\\mbox{expected})^2}{\\mbox{expected}}\n", "$$\n", "\n", "In this case, there are five categories, corresponding to the five ethnic/racial categories. Note that the observed numbers are given in the table, and you just calculated the expected in the previous \"Pause for Reflection\".\n", "\n", "**Pause for relection #3:** *Calculate the $\\chi^2$ statistic for this data.*\n", "\n", "Note that the bigger the $\\chi^2$ statistic is, the more evidence *against* the hypothesis that the jury members were selected without regard to race or ethnicity. Without knowing how $\\chi^2$ is distributed, however, it is impossible to say whether the value you got is \"big\" or \"small\", i.e. strong evidence agains the null or weak evidence against it.\n", "\n", "It turns out that the distribution of the $\\chi^2$ statistic can be described mathematically, much like the normal or $T$ distributions. Like the latter, there is a *degrees of freedom* associated with the $\\chi^2$ statistic, and there is a whole family of $\\chi^2$ distributions, one for each degree of freedom. The following plot shows a few: \n", "\n", "\n", "https://github.com/carltoews/teaching/tree/master/statistics/images/Chi-square.png\n", "\n", "\n", "\n", "In general, the number of degrees of freedom will be the number of categories minus one. \n", "\n", "**Pause for reflection #4:** *How many degrees of freedom in the $\\chi^2$ statistic for this example? Find the appropriate curve on the graph above and sketch it in your notebook, paying attention to scale and shape. Put a hashmark at the value of your statistic, and shade in the region under the curve corresponding to \"as or more extreme\". (Note that the are of this region will give us the $P-$ value for our statistic.)*\n", "\n", "Now can use R to calculate this $P-$value exactly. The relevant command is `pchisq`.\n", "\n", "**Pause for reflection #5:** *See if you can figure out how to use the `pchisq` command to calculate the $P$-value of your statistic. Consult the R help function if necesary (i.e. type `?pchisq`.) Record the $P-$value in your notebook.*\n", "\n", "**Pause for reflection #6:** *Use your $P-$value to draw an appropriate conclusion about the hypothesis that jurors are chosen without regard to race or ethnicity.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.2" } }, "nbformat": 4, "nbformat_minor": 0 }