--- title: "Randomness" author: "JJB + Course" date: "02/18/2019" output: html_document: toc: true toc_float: collapsed: false --- # Randomness ## Example: Random Numbers ```{r} set.seed(55) rnorm(1) set.seed(55) rnorm(1) ``` ## Example: Recall the Modulus operator ```{r example-mod} 12 %% 7 ``` ```{r create-mod-combination-table} outer(1:5, 1:5, `*`) ``` ```{r format-mod-table} mod_table = data.frame(outer(1:9, 2:9, `%%`)) colnames(mod_table) = paste0("mod ", seq_len(ncol(mod_table)) + 1) rownames(mod_table) = paste0("x = ", seq_len(nrow(mod_table))) knitr::kable(mod_table) ``` ## Example: Linear Congruential Method (LCM) The LCM method **was** a popular method for generating random numbers. The generator provides the ability to create sequences of deterministic values given a seed. For instance, the generator would create: \[{X_0},{\text{ }}{X_1},{\text{ }}{X_2},{\text{ }}{X_3}, \cdots ,{\text{ }}{X_n}\] where: - $X_0$ is the **seed specified**. - $X_1$ is the **first** random number generated. - $X_2$ is the **second** random number generated. - ... - $X_n$ is the **n**th random number generated. The generator creates random values by the following _deterministic_ process: $$ \begin{align*} X_0 &= \text{seed} \\ {X_{i + 1}} &= \left( {a{X_i} + c} \right)\bmod m \end{align*} $$ where: - $X_{i}$ is the **previous** random number generated - $X_{i+1}$ is the **next** random number generated - $a$ is the multiplier - requires $0 < a < m$ - $c$ is the increment or shifted value - requires $0 \le c < m$ - $m$ is the modulus - requires $m > 0$. The choices of these values are crucial in ensuring good periodicity. (See slides for bad choices vs. good.) Generally, LCG works nicely if: 1. $m$ and $c$ are relatively prime. (e.g. only 1 divides both) 1. Only variants of $m$ divide $a - 1$ 1. If $m$ is divisible by $4$, then $a-1$ is divisible by $4$. **Note:** Mersenne-Twister RNG by M. Matsumoto and T. Nishimura (1997), provides sequence generation of $2^{19937}-1$. For more information, please see `?RNGkind`. ```{r lcm-method} # Setup values a = 5181215173 c = 12581 m = 2^32 # Initial Seed seed = as.numeric(Sys.time()) * 1000 # Pre-allocation of numeric vector x = numeric(length = 2) # Set initial value x[1] = seed x[2] = (a * x[1] + c) %% m x # Real number r = x[2]/m r ``` **Question:** How can we ensure $R_i$ is able to generate values between $\left[ {0,1} \right]$ and not $\left[ {0,1} \right)$? # Sampling There are many different ways to break up a sample. ## Example: Sample without Replacement ```{r sample-without-replacement-numbers} # Set seed for reproducibility set.seed(90210) # Sample values without replacement sample(x = 8, size = 4, replace = FALSE) # Sample values without replacement sample(x = 8, size = 8, replace = FALSE) ``` What happens if we sample values **without** replacement with a sample size that is _larger_ than our initial population? ```{r, error = TRUE} sample(x = 8, size = 10, replace = FALSE) ``` Alternatively, we can use actual vectors for `x` in `sample()`. In this case, we're sampling from a vector containing two values. ```{r sample-without-replacement-characters} # Set seed for reproducibility set.seed(42) # Sample values without replacement sample(x = c("heads", "tails"), size = 1, replace = FALSE) ``` ### Exercise: Die Roll ```{r ex-sample-w-o-replace} # Set a seed to control reproducibility set.seed(385) ``` ## Example: Sampling with Replacement ```{r sample-with-replacement-numbers} # Set seed for reproducibility set.seed(5318008) # Sample numbers with replacement sample(x = 10, size = 20, replace = TRUE) ``` ```{r sample-with-replacement-characters} # Set seed for reproducibility set.seed(376006) # Sample characters with replacement y = sample(x = c("heads", "tails"), size = 10, replace = TRUE, prob = c(0.4, 0.6)) y ``` Count observations and/or obtain the proportion: ```{r show-frequencies} table(y) ``` Obtain the proportions of how often the observations show up in the data. ```{r show-proportions} prop.table(table(y)) ``` ```{r} # Sample characters with replacement y_100 = sample(x = c("heads", "tails"), size = 100, replace = TRUE, prob = c(0.4, 0.6)) prop.table(table(y_100)) # Sample characters with replacement y_1000 = sample(x = c("heads", "tails"), size = 1000, replace = TRUE, prob = c(0.4, 0.6)) prop.table(table(y_1000)) ``` ### Exercise: Creating a "Loaded" die Create a call to sample such that it can roll a "loaded" six-sided die twice with probabilities: 1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2 ```{r rng-probs-loaded-dice} set.seed(8888) # your code here probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2) sum(probs) sample(6, size = 2, replace = TRUE, prob = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2)) ``` **Question:** What happens if the sum of probability doesn't equal 1? ```{r rng-probs-loaded-dice} set.seed(8888) # your code here probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.9) sum(probs) my_sample = sample(6, size = 1e8, replace = TRUE, prob = probs) prop.table(table(my_sample)) ``` # Probability Distributions ## Example: `d`, `p`, `q`, `r` Distribution Functions Within _R_ there are four different kinds of distribution functions. These functions are known by: | Function | Description | |:-------------------|:------------------------------------------------------------------| | `d*(x)` | Compute the probability density function height at $x$ | |--------------------|-------------------------------------------------------------------| | `p*(x)` | Calculate $q$ given $x$ under $P\left(X\le x\right) = q$ | |--------------------|-------------------------------------------------------------------| | `q*(q)` | Calculate $x$ given $q$ under $P\left(X\le x\right) = q$ | |--------------------|-------------------------------------------------------------------| | `r*(n)` | Generate $n$ random values from the distribution | |--------------------|-------------------------------------------------------------------| where `*()` refers to the different kinds of distribution. For instance: +--------------+-------------+---------------+------------------------------------+ | Distribution | `*()` | Parameters | Parameter Description | +==============+=============+===============+====================================+ | Normal | `norm()` | `mean = 0` | Mean or Center of distribution | | | | `sd = 1` | Standard deviation or spread | +--------------+-------------+---------------+------------------------------------+ | Uniform | `unif()` | `min = 0` | Distribution minimum value | | | | `max = 1` | Distribution maximum value | +--------------+-------------+---------------+------------------------------------+ As an example, consider the Normal distributions' `d`, `p`, `q`, and `r`-variants. ```{r example-distribution-calls} # Normal PDF dnorm(c(-1, 0, 1)) # Normal CDF pnorm(c(-1, 0, 1)) # Normal iCDF qnorm(c(0.15, 0.5, 0.84)) # Normal RNG set.seed(15) # Set seed to ensure reproducibility rnorm(2) ``` ### Probability Density Function Consider a probability distribution under: \[f\left( {x|\mu ,\sigma } \right) = \frac{1}{{\sigma \sqrt {2\pi } }}\exp \left( { - \frac{{{{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}} \right)\] where: - $\mu$ is the mean - $\sigma$ is the standard deviation - $\sigma^2$ is the variance ```{r standard-normal} library("ggplot2") ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = dnorm) + labs(title = "Normal: PDF", y = "f(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ```{r dnorm-ex} library("ggplot2") ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = dnorm)+ stat_function(fun = dnorm, xlim = c(-5, 1.96), geom = "area", fill = "blue") + geom_vline(xintercept = 1.96, color = "orange") + labs(title = "Normal: PDF", y = "f(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ### Cumulative Density Function The CDF function is given as: \[{F_X}\left( x \right) = P\left( {X \le x} \right)\] The CDF function is shown with the PDF when we see "shaded" area under the graph. ```{r pnorm-ex} ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = pnorm) + geom_point(data = data.frame(x = 1.96, y = pnorm(1.96)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 1.96, color = "orange") + labs(title = "Normal: CDF", y = "F(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ### Quantile Function The quantile function is defined as: $$x = F_X^{ - 1}\left( p \right)$$ The quantile approach allow us to convert a probability value (output) into an input value for the CDF function. As such, this is often referred to as the the 'inverse CDF' since: $$ \begin{align*} x &= F_X^{ - 1}\left( p \right) \\ {F_X}\left( {F_X^{ - 1}\left( p \right)} \right) &= p \end{align*} $$ ```{r qnorm-ex} ggplot(data.frame(x = c(0, 1)), aes(x)) + stat_function(fun = qnorm) + labs(title = "Normal: Quantile", y = "x", x = "F(x)") + geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 0.975, color = "orange") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ### Quantile Function - Z-Score ```{r qnorm-ex-zscore} ggplot(data.frame(x = c(0, 1)), aes(x)) + stat_function(fun = qnorm) + labs(title = "Normal: Quantile", y = "x", x = "F(x)") + geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 0.975, color = "orange") + geom_hline(yintercept = qnorm(0.975), color = "red") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ### Exercise: Finding quantiles Using the `qnorm()` function, determine what the quantiles are at probabilities: - 0.5 - 0.95 - 0.975 **Have you seen these values before?** ```{r find-probs} ``` ### Random Number Generation Function ```{r rnorm-ex} set.seed(1) # Generate 1000 values from a # Normal distribution centered at mean = 0 and sd = 1 generate_normal_values = rnorm(1000) # Generate 1000 values from a # Normal distribution centered at mean = 0 and sd = 4 generate_normal_values_sd = rnorm(1000, sd = 4) # Generate 1000 values from a # Normal distribution centered at mean = 2 and sd = 9 generate_normal_values_msd = rnorm(1000, mean = 2, sd = 9) set.seed(1) ggplot(data.frame(x = rnorm(1000)), aes(x = x)) + geom_histogram(aes(y = ..density..), fill = "orange") + stat_function(fun = dnorm, color = "blue") + geom_vline(xintercept = 0, color = "red") + labs(title = "Normal: Random Numbers", y = "f(x)", x = "Samples") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16)) ``` ### Exercise: Finding quantiles How can we recreate a **coin flip** for 100 observations using: - Uniform Distribution - Binomial Distribution ```{r flip-uniform} uniform_draws = runif(10) uniform_draws # [0, 1] ifelse(uniform_draws > 0.5, "heads", "tails") ``` ```{r flip-binomial} set.seed(999) # Problematic because the values range from [0, 2] binom_draws = rbinom(10, size = 2, prob = 0.5) # We need to ccontrol the randomness such that # values come from [0, 1] binom_draws_correct = rbinom(10, size = 1, prob = 0.5) ifelse(binom_draws_correct, "heads", "tails") ```