---
title: "Data Oddities"
author: "JJB + Course"
date: "02/13/2019"
output:
   html_document:
     toc: true
     toc_float:
       collapsed: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Vector Review

## Example: Making an _atomic_ vector

We've already made _atomic_ vectors through our use of the `c()` or
combine function to merge together multiple expressions. However,
we never referred to it directly as an _atomic_ vector.

```{r example-atomic}
x = c(104, 0, 12, 237, -5)
```

### Exercise: Creating a vector

Create a character variable called `my_name`. 

In the first entry, put your **first** name. 

In the second entry, put your **last** name. 

As an added bonus, try to concatenate the values together.

_Hint_ Look at the `paste()` function.

## Example: Recreating a vectorization with a constant.

Let's try to _recreate_ how _R_ performs a vectorization when a vector does
_not_ have the same length. To do so, we'll need to use the `rep()` function.
The `rep()` function allows us to replicate values. 

```{r repeat-obs}
# Create a vector with the number 42 repeated five times
rep(42, times = 5)  

# Create a vector with repeated elements of a fixed
rep(c(1, 2), length.out = 3) 

# Create a vector with each element repeated a set number of times.
rep(c(3, 4), each = 2)
```

So, let's replicate a constant:

```{r vectorization-constant}
constant = 5

# Vector
x = c(1, 2, 3, 4)
x

# Replicate a constant
y = rep(constant, times = length(x))
y

x + y 

x + constant
```

### Exercise: Determining recycling properties of vectorization for differing lengths

Vectorization works out well when the two vectors are of the same length. 

e.g. 

```{r length-of-vec}
length(x)
length(y)
```

What happens when the length of `x` differs from a new vector, say `z`? 

- Case 1: `z` is a multiple of `x`
- Case 2: `z` is _not_ a multiple of `x`

```{r cases-vectorization}
x = c(1, 2, 3, 4)
z_1 = c(1, 2)
```

_Hint_ There are two cases at play here. 

### Exercise: Recreate recycling property of vectorization for an uneven vector

Consider a vector `a` that is defined as:

```{r a-defined}
a = c(8, 9)
```

Using the `rep()` function, figure out a way to recreate:

```{r example-uneven}
x = c(1, 2, 3, 4)
x + a
```

Is your solution robust? What happens if `a` changes to:

```{r changed-a-value}
a = c(8, 9, 10)
```

# Vectors and Lists

## Example: Vector Properties

All vectors have about 4 different properties.

```{r vector-numeric}
x = c(1, 2, 3, 4)

typeof(x)
length(x)
attributes(x)
class(x)
```

### Exercise: Determine a vectors properties

Create a vector called `my_ints` that contains the 
following _integer_ numbers:

42, 188, 69, 0, -1

_Hint_ Unlike `numeric`s, `integer` numbers in _R_ must have
an _L_ immediately proceeding it. 

## Example: Checking an _atomic_ vector's data type

We can _check_ whether `x` is a `numeric` vector by using a variant of
`is.*()`. 

```{r check-ints}
is.numeric(x)
```

### Exercise: Verify _atomic_ vector data type

Verify the integer vector created is indeed an `integer` and not a
`numeric`.

## Example: Creating a List (Generic Vector)

Lists can contain a mixed type of data. They are very helpful for returning
_multiple_ objects or working with semi-structured data. 

```{r}
x = list(
  c(1, 2, 3),
  "text",
  c(1.3, 2.5),
  list(
    c(TRUE, FALSE)
  ),
  list(
    c(-1),
    c(-5)
  )
)

# Only has a length
length(x)

# Notice NULL is presence on dim
dim(x)
```

Another example of list creation to show versatility.

```{r list-v2-creation}
character_vec = c("toad", "movie",
                  "stats", "green")
numeric_vec = c(1, 2, 3, 4)
integer_vec = c(1L, 2L, 3L, 4L)
logical_vec = c(TRUE, T, FALSE, F)

list_vec = list(
  char = character_vec,
  num  = numeric_vec,
  int  = integer_vec,
  bool = logical_vec
)

list_vec
```


### Exercise: Create a List-ception.

Create another `list` called `my_list`. Make the first element in the 
list a vector containing your _weight_ and the _second_ vector `list_vec`,
which was the `list` created in the prior example.


```{r}
weight = c(220, 136, 150)

my_list = list(
  weight = weight, 
  listception = list_vec
)

my_list

# Question:
# How do we access list content inside of alist?
my_list[["weight"]]
my_list[["listception"]]
my_list[["listception"]][["bool"]]
my_list[[2]][[4]]

```


## Example: Mixed (Heterogenous) Function Return

The `list` data structure is also very convenient for returning multiple
values within an _R_ function.

```{r func-list-return}
return_list = function(a, b, c) {
                             # ||  Unnamed value
  list(element1 = a, toad = b, c )
       # ^^^^^       ^^^^  Named values 
}

# Emphasize the mixing of data types
out = return_list(1:3, c("a", "b"), c(2 + 3i, 4 - 1i))
out
```

We could have achieved the above using:

```{r}
out = list(1:3, c("a", "b"), c(2 + 3i, 4 - 1i))
```

The emphasis, however, was on how a list could be beneficial
in the scope of a function.

## Example: Accessing List Items

Lists work differently with the `[]` operator than previously defined.

```{r}
# Retrieve by name the list item called `element1`
out$element1

# Retrieve by position the list item called `element1`
out[[1]]

# Update value
out$element1 = c(5L, 42L, -2L) 

# Alternatively, we could use:
# out[[1]] = c(5L, 42L, -2L) 

# View values
out[[1]]
```


## Example: Preserving vs. Simplifying List Structure

There are two "rules" that indicate the underlying structure of a 
subset object in _R_.

1. Preserving: Retain parent data structure.
1. Simplifying: Simplify data structure to most basic type

These rules are "automatic" in nature.

```{r}
# Single brackets retain the list structure around item
out[2]

# Double bracket remove the list structure around item
out[[2]]

# Dollar signs also remove the list structure around item
out$toad
```

Let's explore what happens when we subset a `list` with
the single bracket operator.

```{r who-i-am-list-example}
who_i_am = list(name = "James",
                job = "Instructor",
                pay = "Not enough",
                course = 385,
                "drones")

who_i_am[["name"]]

# Notice in environment it's a value
my_name = who_i_am[["name"]]

# Notice we've created another "Data" object
# that can be inspect via the magnifine glass.
my_name_list = who_i_am["name"]
```

We can see a similar pattern of simplification of data structures
during other subset operations.

```{r simplication-df}
a = data.frame(x = 1:10, y = 2:11)

# Returns a vector
a[, 1]

# Returns a data frame with column
a[, 1, drop = FALSE]


# Errors:
# Reduced to an atomic vector so, we cannot use $ to extract.
# a[, 1]$x

# This would work still as we have maintained the underlying data structure of a data.frame
a[, 1, drop = FALSE]$x
```


Unfortunation, the preservation of list structure doesn't quite
follow the same semantics established with subsetting other elements.

```{r simplification-matrix}
my_mat = matrix(1:10, nrow = 5)

my_mat

my_mat[1, ]

my_mat[1,, drop = FALSE]
```

### Exercise: Identify Simplifying or Preserving Subsets

```{r simplify-or-preserve}
my_list = list(val = c(1, 2) , 2) 
my_list[[1]]
my_list$val

my_matrix = matrix(c(1, 0, 1, 0), nrow = 2)
my_matrix[1,, drop = TRUE] 
```

# Coercion

## Example: Implicit Coercion

_R_ will automatically handle type conversion as it is a "weakly" typed language.

```{r}
c(TRUE, "PIE")

c(TRUE, 42.5, 45 + 1i , "PIE")

c(TRUE, FALSE, 4.5+1i)
```

## Example: Coercion Hierarchy

Logical as the base

```{r}
logical_vec = c(TRUE, FALSE, T, F)
logical_vec
```

Logical to Integer

```{r logical-to-int}
int_vec = c(logical_vec, 42L)
int_vec
```

Integer to Numeric

```{r int-to-num}
numeric_vec = c(int_vec, 32.9)
numeric_vec
```

Numeric to Complex

```{r num-to-complex}
complex_vec = c(numeric_vec, 8.0 + 1.0i)
complex_vec
```

Complex to Character

```{r complex-to-character}
character_vec = c(complex_vec, "toad")
character_vec
```

## Example: Explicit Coercion

There are times when we want to directly ensure values are of a
different type. One way to accomplish this is to strictly cast a
value.

Force to character

```{r character-coercion}
as.character(c(TRUE, 1, 9.8))    
```

Force to integer

```{r integer-coercion}
as.integer(c(5.3, 8.8))  
```

Force to logical

```{r logical-coercion}
as.logical(c(1L, 0L))  
```

Force to numeric

```{r numeric-coercion}
as.numeric(c(42L, 58L))
```


### Exercise: Determining End Types

Consider the following vector construction statements. Determine what the
final data type is.

```{r coercise-test, eval = FALSE}
# What's the class?
c(1, 2, 3)

c(1L, 2L, 3L)

c(TRUE, 0L)

c(FALSE, "toad", 3)

TRUE + 1

```

# Special Values

## Example: Special Computational Values

_R_ has certain values built in to handle different computational issues
that arise. These values are "Special" due to the unique behaviors 
associated with each value.

```{r special-values-computational}
NaN    # Not a Number appears if computation doesn't make sense.
Inf    # Positive Infinity
-Inf   # Negative Infinity

# Sample computations
1L / 0L

0L / 0L

Inf - Inf
```


## Example: Attempt to Override Special Value

What happens if we accidentally reassign a value? 

```{r special-values-override-attempt}
# Create a variable to hold a value temporarily
TEMP = FALSE
# Reassign "FALSE" to always be TRUE
`FALSE` = TRUE
# Reassign "TRUE" to always be FALSE
`TRUE` = TEMP
```

Did it work?

```{r show-regular-true}
TRUE
```

What happens if we use the back ticks, e.g. ` ` ?

```{r show-backtick-true}
`TRUE`
```

## Aside: Modifying Base

The only way to modify reserved or special values is to directly modify
the Base environment. This is _highly_ restricted in practice.

```{r inject-uncertainty, eval = FALSE}
# Small scope base variable change

# Randomly decide on a value
true_or_false = function() {
runif(1) < 0.5
}

# Override the reserved words
makeActiveBinding(quote(T), true_or_false,
as.environment("Autoloads"))

makeActiveBinding(quote(F), true_or_false,
as.environment("Autoloads"))

# Enjoy uncertainty
set.seed(881)
T
F
T
```

# Missingness

## Example: Data with missingness

If we did not record data, we use the value `NA` to indicate missingness. 
This allows us to retain similar lengths for each data structure.

```{r inserted-missingness-subject-heights}
subject_heights_na = data.frame(
  id = c(1, 2, 3, 55),
  sex = c("M", "F", NA, NA),
  height = c(6.1, NA, 5.2, NA)
  )
```


### Exercise: Inserting Missingness - Twitter Stocks

Insert missingness into the twitter stock price data pursuant to the slides:

```{r insert-missingness-twtr}
twtr_stock_prices = data.frame(
  time     = as.POSIXct(
    c("09:30 AM", NA, "09:50 AM", "10:00 AM"),
    format = "%I:%M %p"),
  price    = c(22.40, 22.38, 22.46, NA)
)

twtr_stock_prices
```

### Exercise: Inserting Missingness - Twitter Stocks

Insert missingness into the champaign weather data pursuant to the slides:

```{r insert-missingness-champaign}
champaign_weather = data.frame(
  date = as.Date(
  c("1/21", "1/22", "1/23", "1/24", "1/25", "1/26", "1/27"),
  format = "%e/%d"
  ),
  temp = c(44, 46, NA, 26, 37, 44, NA),
  rain = c(NA, TRUE, TRUE, FALSE, NA, FALSE, FALSE),
  wind = c(NA, 19, NA, NA, 14, NA, 12)
  )
  
champaign_weather
```

## Example: Missing in Action

```{r missing-data}
original_iq = data.frame(Age = c(18, 19, 19, 22, 25, 28, 30),
                         IQ = c(112, 108, 94, 87, 132, 79, 103))

mcar_iq = data.frame(Age = c(18, 19, 19, 22, 25, 28, 30),
                     IQ = c(NA, 108, 94, 87, NA, 79, NA))

mar_iq = data.frame(Age = c(18, 19, 19, 22, 25, 28, 30),
                    IQ = c(NA, 108, 94, 87, NA, 79, NA))

mnar_iq = data.frame(Age = c(18, 19, 19, 22, 25, 28, 30),
                     IQ = c(112, 108, NA, NA, 132, NA, 103))
```

## Example: Determining if Data is Missing

```{r determine-missingness}
data_with_missing = data.frame(Age = c(18, 19, 19, 22, 25, 28, 30),
                               IQ = c(NA, 108, 94, 87, NA, 79, NA))

checked_data = is.na(data_with_missing)
```

## Example: Imputing Values for Missing Data

Imputation is the assignment of a value when missingness is abound. 
There are many strategies behind imputing values to inspire an entire
subfield of statistics. Here, we're going to use the `median()`. 

Always begin with making a _copy_ of the `data.frame` you wish to manipulate.

```{r create-data-copy}
# Copy data
imputed_df = data_with_missing 

# Verify copy is correct
all.equal(imputed_df, data_with_missing)
```

Create list of missing observations in IQ

```{r create-index-for-imputation}
index_na = is.na(data_with_missing$IQ)
index_na
```

Impute (or set missing values to) the median of the data

```{r show-imputation}
imputed_df[index_na, "IQ"] = median(data_with_missing$IQ)

imputed_df
```

## Example: Retrieving Complete Cases

Subset any row with an NA.

```{r omit-missing-data}
data_present = na.omit(data_with_missing)
```

By subsetting with logicals

```{r retrieve-nonmissing-data}
data_present_complete = data_with_missing[complete.cases(data_with_missing), ]
```

Verify both approach converge

```{r verify-subset-approaches}
all.equal(data_present, data_present_complete, check.attributes = FALSE)
```

**Note:** The `check.attributes` parameter actively ignores additional
information that is found when doing `na.omit()`. In particular, you can retrieve row-index information for what were "incomplete" rows.