---
title: "Navigating Data"
author: "JJB + Course"
date: "01/25/2019"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Travis-CI

Regarding "insert Travis-CI badge here".

This means you should insert into the `README.md` file:

```
[![](https://travis-ci.com/stat385-sp2019/hw01-<GH_USERNAME>.svg)](https://travis-ci.com/stat385-sp2019/hw01-<GH_USERNAME>)
```

# Getting Help

## Example: Open the Help Documentation

To access a function's help documentation in the lower-right "Help" panel of
_RStudio_, type into the _R_ **Console** (lower-left side): 

```{r example-help-calls, eval = FALSE}
?function_name
help(function_name)
```

where `function_name` is the name of the function.

For example, let's say we want to understand how the `median()` function works.
We would want to use in the _R_ **Console** either:

```{r help-question-mark-median}
?median
```

or

```{r help-function-median}
help(median)
```

See the slides for an annotated version of _how_ to read the help documentation.

**NB** You can _run_ the above code chunks without _knitting_ the document by
using `Cmd/Cntrl + Enter` keyboard shortcut.

**NB** is short for "note well"

Also, consider using the `example(function_name)` function to automatically run
examples found at the bottom of a function's help file. For example, the
`median` function's examples can be run with:

```{r}
example(median)
```

## Exercise: Try getting help!

Request the help documentation for `mean` with `help(function_name)`.

```{r help-mean}

### Code

```

Run the examples in the `mean` function's help documentation with `example(function_name)`

```{r example-mean}

### Code

```

# Data Structures

## Example: Vectors

```{r 1d-vectors}
# Vector of character elements
character_values = c("James", "summer", "Hi guys!")

# Vector of numeric elements
numeric_values = c(3.14, 8.2, -1.4123, 0.333)

# Vector of integer elements
integer_values = c(4L, -7L, 52L, 98L)

# Create sequences: 1, 2, ... , 9, 10
integer_sequence = 1L:10L
                   # ^ colon operator
```

## Example: Combining Data

Combining decimal number expressions

```{r combining-decimal-values}
numeric_values = c(6.1, 5.5, 5.2, 5.9)
numeric_values 

class(numeric_values)
typeof(numeric_values)
```

Combining character/string expressions

```{r combining-character-values}
character_values = c("M", "F", "F", "M")
character_values
```

Combining whole or integer number expressions

```{r combining-integer-values}
integer_values = c(1L, 2L, 3L, 55L)
integer_values

# Verify integer
class(integer_values)
```

Fill the above code chunk in. When done, remove the `eval = FALSE` from the code chunk options.


# Structured Data in _R_ 

## Example: Using a data.frame

Here we are first combining expressions together and _then_ assign the
combination into a `data.frame`

```{r dataframe-construction-subjects}
subject_heights = data.frame(
  id     = c(1, 2, 3, 55),
  sex    = c("M", "F", "F", "M"),
  height = c(6.1, 5.5, 5.2, 5.9)
)
```

## Example: Retrieving Data **Dim**ensions 

We can individually retrieve **rows** and **columns** with:

```{r data-show-individual-rows-columns}
num_rows = nrow(subject_heights)
num_columns = ncol(subject_heights)

num_rows
num_columns

subject_heights
```

If we want to access _both_ values simultaneously, we would use:

```{r data-show-combined-rows-and-columns}
dim_info = dim(subject_heights)

dim_info
```

The values correspond to the number of _rows_ and _columns_ respectively.

We can re-use the previously written code earlier in the document instead of re-typing out what the values are. 

```{r redo-dataframe-construction-subjects, eval = FALSE}
subject_heights = data.frame(
  id     = integer_values,
  sex    = character_values,
  height = height_values
)
```

### Exercise: Construct data

Please create the `data.frame` for `twtr_stock_prices`.

_Hint_ to create a code chunk quickly, use `Cmd/Cntrl + Opt/Shift + I`


Please create the `data.frame` for `champaign_weather`.


### Exercise: Extract the **Dim**ensions

Retrieve the dimensions of `twtr_stock_prices` using **functions that retrieve only one dimension at a time**.

```{r twtr_stock-dimensions-two-functions}

```


Retrieve the dimensions of `champaign_weather` using **only one function**.

```{r dims-champaign-weather}

```

## Example: Dynamically Using Data Attributes

We can protect against a _bad_ data set by dynamically
retrieving attributes and its name. 

```{r calc_obs, echo = FALSE} 
# Substitute your dataset where you see: `mtcars`
data_nobs = nrow(mtcars)                 # N Observations
data_nvars = ncol(mtcars)                # N Variables
data_name = deparse(substitute(mtcars))  # NSE Magic
``` 

Did you know that there are `r data_nobs`
observations and `r data_nvars` variables 
contained within the `r data_name`  data set?

# Brackets [ ] Data Access


## Example: Retrieving a Single Value

```{r retrieve-single-value}
# By position
subject_heights[3, 2]

# By position & variable
subject_heights[3, "sex"]

# By boolean logic
subject_heights[c(FALSE, FALSE, TRUE, FALSE),
                c(FALSE, TRUE, FALSE)]
```

## Example: Positions Matter

It's common to mix up row and column positions.

```{r mix-up-column-and-row}
# Correct
subject_heights[3, 2]

# Incorrect
subject_heights[2, 3]
```

## Example: Retrieve Multiple Values

```{r get-multiple-values}
# By position
subject_heights[3, ]

# By logical
subject_heights[
  subject_heights$sex == 3, 
]

# By variable
subject_heights[, "sex"]
```

## Example: Removing values

```{r drop-values}
# Negative position
subject_heights[-2, ]


# Negative Column
subject_heights[, -1]
```

# Data Overviews

For the next example, we'll use some sample data that has already be converted to an _R_ object.

```{r read-subject-heights} 
subject_heights = readRDS("subject_heights.rds")
``` 

## Example: Glancing at Data

```{r glance-data}
# Display the first 3 rows
head(subject_heights, n = 3)

# Show the last 5 rows
tail(subject_heights, n = 5)
```


## Example: Summaries

Summaries are useful to perform to see how the data is loaded. 

```{r summarize-data}
summary(subject_heights)
```

- Min: Minimum or lowest value    
- 1st Qu: First Quantile, where 25% of the data resides.
- Median: Second quantile or where 50% of the data resides.
- Mean: The average of all values in the data set.
- 3rd Qu: Third Quantile, where 75% of the data resides.
- Max: Maximum or highest value
- NA: Number of missing values.

To obtain an improved summary, consider checking out the `skimr` package.
One notable author of the package is features Michael Quinn who graduated from
UIUC's Dept. of Statistics Masters Program.

```{r skimr-demo}
# Uncomment to install
# install.packages("skimr")
library("skimr")

# View alternative summary() output 
skim(subject_heights)
```