---
title: "Randomness"
author: "JJB + Course"
date: "02/18/2019"
output:
   html_document:
     toc: true
     toc_float:
       collapsed: false
---

# Randomness

## Example: Random Numbers

```{r}
set.seed(55)
rnorm(1)

set.seed(55)
rnorm(1)
```

## Example: Recall the Modulus operator

```{r example-mod}
12 %% 7
```

```{r create-mod-combination-table}
outer(1:5, 1:5, `*`)
```

```{r format-mod-table}
mod_table = data.frame(outer(1:9, 2:9, `%%`))
colnames(mod_table) = paste0("mod ", seq_len(ncol(mod_table)) + 1)
rownames(mod_table) = paste0("x = ", seq_len(nrow(mod_table)))

knitr::kable(mod_table)
```

## Example: Linear Congruential Method (LCM)

The LCM method **was** a popular method for generating random numbers. The generator provides the ability to create sequences of deterministic values given a seed. For instance, the generator would create:

\[{X_0},{\text{ }}{X_1},{\text{ }}{X_2},{\text{ }}{X_3}, \cdots ,{\text{ }}{X_n}\]

where:

- $X_0$ is the **seed specified**.
- $X_1$ is the **first** random number generated.
- $X_2$ is the **second** random number generated.
- ...
- $X_n$ is the **n**th random number generated.

The generator creates random values by the following _deterministic_ process:

$$
\begin{align*}
X_0 &= \text{seed} \\  
{X_{i + 1}} &= \left( {a{X_i} + c} \right)\bmod m
\end{align*}
$$

where:

- $X_{i}$ is the **previous** random number generated
- $X_{i+1}$ is the **next** random number generated
- $a$ is the multiplier
    - requires $0 < a < m$
- $c$ is the increment or shifted value
    - requires $0 \le c < m$
- $m$ is the modulus 
    - requires $m > 0$.
    
The choices of these values are crucial in ensuring good periodicity. (See slides for bad choices vs. good.)

Generally, LCG works nicely if:

1. $m$ and $c$ are relatively prime. (e.g. only 1 divides both)
1. Only variants of $m$ divide $a - 1$
1. If $m$ is divisible by $4$, then $a-1$ is divisible by $4$.


**Note:** Mersenne-Twister RNG by M. Matsumoto and T. Nishimura (1997), provides sequence generation of $2^{19937}-1$. For more information, please see `?RNGkind`.

```{r lcm-method}
# Setup values
a = 5181215173
c = 12581
m = 2^32

# Initial Seed
seed = as.numeric(Sys.time()) * 1000

# Pre-allocation of numeric vector
x = numeric(length = 2)
  
# Set initial value
x[1] = seed
x[2] = (a * x[1] + c) %% m

x

# Real number
r = x[2]/m 

r
```

**Question:** 
How can we ensure $R_i$ is able to generate values between $\left[ {0,1} \right]$ and not $\left[ {0,1} \right)$?

# Sampling

There are many different ways to break up a sample.

## Example: Sample without Replacement

```{r sample-without-replacement-numbers}
# Set seed for reproducibility
set.seed(90210)

# Sample values without replacement
sample(x = 8, size = 4, replace = FALSE)

# Sample values without replacement
sample(x = 8, size = 8, replace = FALSE)
```

What happens if we sample values **without** replacement with a sample size that is _larger_ than our initial population?

```{r, error = TRUE}
sample(x = 8, size = 10, replace = FALSE)
```

Alternatively, we can use actual vectors for `x` in `sample()`.
In this case, we're sampling from a vector containing two values.

```{r sample-without-replacement-characters}
# Set seed for reproducibility
set.seed(42)

# Sample values without replacement
sample(x = c("heads", "tails"),
       size = 1, replace = FALSE)
```

### Exercise: Die Roll

```{r ex-sample-w-o-replace}
# Set a seed to control reproducibility
set.seed(385)


```


## Example: Sampling with Replacement

```{r sample-with-replacement-numbers}
# Set seed for reproducibility
set.seed(5318008)

# Sample numbers with replacement
sample(x = 10, size = 20, replace = TRUE)
```

```{r sample-with-replacement-characters}
# Set seed for reproducibility
set.seed(376006)

# Sample characters with replacement
y = sample(x = c("heads", "tails"),
       size = 10, replace = TRUE, prob = c(0.4, 0.6))
y
```

Count observations and/or obtain the proportion:

```{r show-frequencies}
table(y)
```

Obtain the proportions of how often the observations show up in the data.

```{r show-proportions}
prop.table(table(y))
```


```{r}
# Sample characters with replacement
y_100 = sample(x = c("heads", "tails"),
       size = 100, replace = TRUE, prob = c(0.4, 0.6))

prop.table(table(y_100))


# Sample characters with replacement
y_1000 = sample(x = c("heads", "tails"),
       size = 1000, replace = TRUE, prob = c(0.4, 0.6))

prop.table(table(y_1000))
```


### Exercise: Creating a "Loaded" die

Create a call to sample such that it can roll a "loaded" six-sided die
twice with probabilities:

1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2

```{r rng-probs-loaded-dice}
set.seed(8888)

# your code here

probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2)

sum(probs)

sample(6, size = 2, replace = TRUE, prob = 
         c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2))
```

**Question:** What happens if the sum of probability doesn't equal 1? 


```{r rng-probs-loaded-dice}
set.seed(8888)

# your code here

probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.9)

sum(probs)

my_sample = 
  sample(6, size = 1e8, replace = TRUE, prob = 
       probs)

prop.table(table(my_sample))
```

# Probability Distributions

## Example: `d`, `p`, `q`, `r` Distribution Functions

Within _R_ there are four different kinds of distribution functions.
These functions are known by: 

| Function           |  Description                                                      |
|:-------------------|:------------------------------------------------------------------|
| `d*(x)`            |  Compute the probability density function height at $x$           |
|--------------------|-------------------------------------------------------------------|
| `p*(x)`            |  Calculate $q$ given $x$ under $P\left(X\le x\right) = q$         |
|--------------------|-------------------------------------------------------------------|
| `q*(q)`            |  Calculate $x$ given $q$ under $P\left(X\le x\right) = q$         |
|--------------------|-------------------------------------------------------------------|
| `r*(n)`            |  Generate $n$ random values from the distribution                 |
|--------------------|-------------------------------------------------------------------|

where `*()` refers to the different kinds of distribution. For instance:

+--------------+-------------+---------------+------------------------------------+
| Distribution |  `*()`      | Parameters    | Parameter Description              |
+==============+=============+===============+====================================+
| Normal       | `norm()`    | `mean = 0`    | Mean or Center of distribution     |
|              |             | `sd = 1`      | Standard deviation or spread       |
+--------------+-------------+---------------+------------------------------------+
| Uniform      | `unif()`    | `min = 0`     | Distribution minimum value         |
|              |             | `max = 1`     | Distribution maximum value         | 
+--------------+-------------+---------------+------------------------------------+

As an example, consider the Normal distributions' `d`, `p`, `q`, and `r`-variants.

```{r example-distribution-calls}
# Normal PDF
dnorm(c(-1, 0, 1))

# Normal CDF
pnorm(c(-1, 0, 1))

# Normal iCDF
qnorm(c(0.15, 0.5, 0.84))

# Normal RNG
set.seed(15)   # Set seed to ensure reproducibility
rnorm(2)
```

### Probability Density Function 

Consider a probability distribution under:

\[f\left( {x|\mu ,\sigma } \right) = \frac{1}{{\sigma \sqrt {2\pi } }}\exp \left( { - \frac{{{{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}} \right)\]

where:

- $\mu$ is the mean
- $\sigma$ is the standard deviation
- $\sigma^2$ is the variance

```{r standard-normal}
library("ggplot2")

ggplot(data.frame(x = c(-5, 5)), aes(x)) +
  stat_function(fun = dnorm) + 
  labs(title = "Normal: PDF", y = "f(x)", x = "x") + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))
```


```{r dnorm-ex}
library("ggplot2")

ggplot(data.frame(x = c(-5, 5)), aes(x)) +
  stat_function(fun = dnorm)+ 
  stat_function(fun = dnorm, 
                xlim = c(-5, 1.96),
                geom = "area",
                fill = "blue") +
  geom_vline(xintercept = 1.96, color = "orange") +
  labs(title = "Normal: PDF", y = "f(x)", x = "x") + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))

```

### Cumulative Density Function 

The CDF function is given as:

\[{F_X}\left( x \right) = P\left( {X \le x} \right)\]

The CDF function is shown with the PDF when we see "shaded" area under the graph.

```{r pnorm-ex}
ggplot(data.frame(x = c(-5, 5)), aes(x)) +
  stat_function(fun = pnorm) +
  geom_point(data = data.frame(x = 1.96, y = pnorm(1.96)), aes(y = y), 
             color = "blue", size = 3) +
  geom_vline(xintercept = 1.96, color = "orange") +
  labs(title = "Normal: CDF", y = "F(x)", x = "x") + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))
```

### Quantile Function 

The quantile function is defined as:

$$x = F_X^{ - 1}\left( p \right)$$

The quantile approach allow us to convert a probability value (output) 
into an input value for the CDF function. As such, this is often referred
to as the the 'inverse CDF' since:

$$
\begin{align*}
x &= F_X^{ - 1}\left( p \right) \\
{F_X}\left( {F_X^{ - 1}\left( p \right)} \right) &= p
\end{align*}
$$

```{r qnorm-ex}
ggplot(data.frame(x = c(0, 1)), aes(x)) +
  stat_function(fun = qnorm) +
  labs(title = "Normal: Quantile", y = "x", x = "F(x)")  + 
  geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), 
             color = "blue", size = 3) +
  geom_vline(xintercept = 0.975, color = "orange") + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))
```


### Quantile Function - Z-Score

```{r qnorm-ex-zscore}
ggplot(data.frame(x = c(0, 1)), aes(x)) +
  stat_function(fun = qnorm) +
  labs(title = "Normal: Quantile", y = "x", x = "F(x)")  + 
  geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), 
             color = "blue", size = 3) +
  geom_vline(xintercept = 0.975, color = "orange") + 
  geom_hline(yintercept = qnorm(0.975), color = "red") + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))

```

### Exercise: Finding quantiles

Using the `qnorm()` function, determine what the quantiles 
are at probabilities:

- 0.5
- 0.95
- 0.975

**Have you seen these values before?**

```{r find-probs}

```

### Random Number Generation Function 

```{r rnorm-ex}
set.seed(1)

# Generate 1000 values from a 
# Normal distribution centered at mean = 0 and sd = 1
generate_normal_values = rnorm(1000)

# Generate 1000 values from a 
# Normal distribution centered at mean = 0 and sd = 4
generate_normal_values_sd = rnorm(1000, sd = 4)

# Generate 1000 values from a 
# Normal distribution centered at mean = 2 and sd = 9
generate_normal_values_msd = rnorm(1000, mean = 2, sd = 9)

set.seed(1)
ggplot(data.frame(x = rnorm(1000)), aes(x = x)) +
  geom_histogram(aes(y = ..density..), fill = "orange") +
  stat_function(fun = dnorm, color = "blue") +
  geom_vline(xintercept = 0, color = "red") + 
  labs(title = "Normal: Random Numbers", y = "f(x)", x = "Samples")  + 
  theme_bw() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 16))
```


### Exercise: Finding quantiles

How can we recreate a **coin flip** for 100 observations using:

- Uniform Distribution
- Binomial Distribution

```{r flip-uniform}

uniform_draws = runif(10)
uniform_draws

# [0, 1]

ifelse(uniform_draws > 0.5,
       "heads", "tails")
```

```{r flip-binomial}
set.seed(999)

# Problematic because the values range from [0, 2]
binom_draws = rbinom(10, size = 2, prob = 0.5)

# We need to ccontrol the randomness such that
# values come from [0, 1]
binom_draws_correct = rbinom(10, size = 1, prob = 0.5)

ifelse(binom_draws_correct, "heads", "tails")
```