--- title: "Navigating Data" author: "JJB + Course" date: "01/25/2019" output: html_document: toc: true toc_float: collapsed: false --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Travis-CI Regarding "insert Travis-CI badge here". This means you should insert into the `README.md` file: ``` [![](https://travis-ci.com/stat385-sp2019/hw01-.svg)](https://travis-ci.com/stat385-sp2019/hw01-) ``` # Getting Help ## Example: Open the Help Documentation To access a function's help documentation in the lower-right "Help" panel of _RStudio_, type into the _R_ **Console** (lower-left side): ```{r example-help-calls, eval = FALSE} ?function_name help(function_name) ``` where `function_name` is the name of the function. For example, let's say we want to understand how the `median()` function works. We would want to use in the _R_ **Console** either: ```{r help-question-mark-median} ?median ``` or ```{r help-function-median} help(median) ``` See the slides for an annotated version of _how_ to read the help documentation. **NB** You can _run_ the above code chunks without _knitting_ the document by using `Cmd/Cntrl + Enter` keyboard shortcut. **NB** is short for "note well" Also, consider using the `example(function_name)` function to automatically run examples found at the bottom of a function's help file. For example, the `median` function's examples can be run with: ```{r} example(median) ``` ## Exercise: Try getting help! Request the help documentation for `mean` with `help(function_name)`. ```{r help-mean} ### Code ``` Run the examples in the `mean` function's help documentation with `example(function_name)` ```{r example-mean} ### Code ``` # Data Structures ## Example: Vectors ```{r 1d-vectors} # Vector of character elements character_values = c("James", "summer", "Hi guys!") # Vector of numeric elements numeric_values = c(3.14, 8.2, -1.4123, 0.333) # Vector of integer elements integer_values = c(4L, -7L, 52L, 98L) # Create sequences: 1, 2, ... , 9, 10 integer_sequence = 1L:10L # ^ colon operator ``` ## Example: Combining Data Combining decimal number expressions ```{r combining-decimal-values} numeric_values = c(6.1, 5.5, 5.2, 5.9) numeric_values class(numeric_values) typeof(numeric_values) ``` Combining character/string expressions ```{r combining-character-values} character_values = c("M", "F", "F", "M") character_values ``` Combining whole or integer number expressions ```{r combining-integer-values} integer_values = c(1L, 2L, 3L, 55L) integer_values # Verify integer class(integer_values) ``` Fill the above code chunk in. When done, remove the `eval = FALSE` from the code chunk options. # Structured Data in _R_ ## Example: Using a data.frame Here we are first combining expressions together and _then_ assign the combination into a `data.frame` ```{r dataframe-construction-subjects} subject_heights = data.frame( id = c(1, 2, 3, 55), sex = c("M", "F", "F", "M"), height = c(6.1, 5.5, 5.2, 5.9) ) ``` ## Example: Retrieving Data **Dim**ensions We can individually retrieve **rows** and **columns** with: ```{r data-show-individual-rows-columns} num_rows = nrow(subject_heights) num_columns = ncol(subject_heights) num_rows num_columns subject_heights ``` If we want to access _both_ values simultaneously, we would use: ```{r data-show-combined-rows-and-columns} dim_info = dim(subject_heights) dim_info ``` The values correspond to the number of _rows_ and _columns_ respectively. We can re-use the previously written code earlier in the document instead of re-typing out what the values are. ```{r redo-dataframe-construction-subjects, eval = FALSE} subject_heights = data.frame( id = integer_values, sex = character_values, height = height_values ) ``` ### Exercise: Construct data Please create the `data.frame` for `twtr_stock_prices`. _Hint_ to create a code chunk quickly, use `Cmd/Cntrl + Opt/Shift + I` Please create the `data.frame` for `champaign_weather`. ### Exercise: Extract the **Dim**ensions Retrieve the dimensions of `twtr_stock_prices` using **functions that retrieve only one dimension at a time**. ```{r twtr_stock-dimensions-two-functions} ``` Retrieve the dimensions of `champaign_weather` using **only one function**. ```{r dims-champaign-weather} ``` ## Example: Dynamically Using Data Attributes We can protect against a _bad_ data set by dynamically retrieving attributes and its name. ```{r calc_obs, echo = FALSE} # Substitute your dataset where you see: `mtcars` data_nobs = nrow(mtcars) # N Observations data_nvars = ncol(mtcars) # N Variables data_name = deparse(substitute(mtcars)) # NSE Magic ``` Did you know that there are `r data_nobs` observations and `r data_nvars` variables contained within the `r data_name` data set? # Brackets [ ] Data Access ## Example: Retrieving a Single Value ```{r retrieve-single-value} # By position subject_heights[3, 2] # By position & variable subject_heights[3, "sex"] # By boolean logic subject_heights[c(FALSE, FALSE, TRUE, FALSE), c(FALSE, TRUE, FALSE)] ``` ## Example: Positions Matter It's common to mix up row and column positions. ```{r mix-up-column-and-row} # Correct subject_heights[3, 2] # Incorrect subject_heights[2, 3] ``` ## Example: Retrieve Multiple Values ```{r get-multiple-values} # By position subject_heights[3, ] # By logical subject_heights[ subject_heights$sex == 3, ] # By variable subject_heights[, "sex"] ``` ## Example: Removing values ```{r drop-values} # Negative position subject_heights[-2, ] # Negative Column subject_heights[, -1] ``` # Data Overviews For the next example, we'll use some sample data that has already be converted to an _R_ object. ```{r read-subject-heights} subject_heights = readRDS("subject_heights.rds") ``` ## Example: Glancing at Data ```{r glance-data} # Display the first 3 rows head(subject_heights, n = 3) # Show the last 5 rows tail(subject_heights, n = 5) ``` ## Example: Summaries Summaries are useful to perform to see how the data is loaded. ```{r summarize-data} summary(subject_heights) ``` - Min: Minimum or lowest value - 1st Qu: First Quantile, where 25% of the data resides. - Median: Second quantile or where 50% of the data resides. - Mean: The average of all values in the data set. - 3rd Qu: Third Quantile, where 75% of the data resides. - Max: Maximum or highest value - NA: Number of missing values. To obtain an improved summary, consider checking out the `skimr` package. One notable author of the package is features Michael Quinn who graduated from UIUC's Dept. of Statistics Masters Program. ```{r skimr-demo} # Uncomment to install # install.packages("skimr") library("skimr") # View alternative summary() output skim(subject_heights) ```