--- title: "Transforming Data" author: "JJB + Course" date: "02/04/2019" output: html_document: toc: true toc_float: collapsed: false --- # Functions Functions are a piece of code that performs a specified task that may or may not depend on parameters and it may or may not return one or more values. ## Example: Add Here we create a function with _two_ parameters that adds together the values passed in. ```{r my-add-func} add = function(x, y) { return(x + y) } add(1, 3) ``` ## Example: Hello World! Consider a common task... ```{r say-hello} message("Hello World!") ``` How could we always repeat this task _without_ needing to retype the code elsewhere? Idea: Use a function to **describe** a recipe. ```{r say-hello-consistently} say_hello_world = function() { message("Hello World!") } say_hello_world() ``` ## Example: Generic Code to Specific Routine Generic _R_ script values ```{r hidden-meaning} set.seed(1115) sample(6, size = 1) sample(6, size = 1) sample(6, size = 1) ``` Adding a name to the routine... ```{r roll-a-die} roll_die = function(num_sides) { roll = sample(num_sides, size = 1) return(roll) } set.seed(1115) roll_die(6) ``` What happens if we forget to specify a `num_sides` value? E.g. what is `roll_die()`? Making the function receive default settings... ```{r roll-die-default} roll_die_default = function(num_sides = 6) { roll = sample(num_sides, size = 1) return(roll) } set.seed(1115) roll_die_default() set.seed(1115) roll_die_default(6) ``` Generalizing to _n_ rolls: ```{r generalized-die-roll} roll_n_die = function(num_rolls, num_sides = 6) { rolls = sample(num_sides, size = num_rolls, replace = TRUE) return(rolls) } set.seed(1115) roll_n_die(3, 6) ``` ## Exercise: Transforming a Workflow Clean up the following code by implementing a function that: 1. Generates data from a normal distribution 2. Applies the mean normalization ```{r make-me-a-func} set.seed(325) x = rnorm(10) y = rnorm(10) x_nmu = (x - mean(x)) / (max(x) - min(x)) x_nmu y_nmu = (y - mean(y)) / (max(y) - min(y)) y_nmu ``` Let's take a little look: ```{r} set.seed(325) x = rnorm(10) y = rnorm(10) z = rnorm(10) x_nmu = (x - mean(x)) / (max(x) - min(x)) x_nmu y_nmu = (y - mean(y)) / (max(y) - min(y)) y_nmu z_nmu = (z - mean(y)) / (max(z) - min(z)) z_nmu ``` ```{r} set.seed(325) mean_normalization <- function(n) { x = rnorm(n) x_nmu = (x - mean(x)) / (max(x) - min(x)) # x_nmu #return(x_nmu) } mean_normalization(10) ``` # Classes and Objects ## Example: Vector Types ```{r view-vectors} # Vector of numeric elements w = c(9.5, -3.14, 88.9999, 12.0) # ^ ^ ^ ^ decimals # Vector of integer elements x = c(1L, 2L, 3L, 4L) # Vector of logical elements y = c(TRUE, FALSE, FALSE, TRUE) # Vector of character elements z = c("a", "b", "c", "d") ``` ## Example: Creating a Data Frame by Hand ```{r viewing-heights} subject_heights = data.frame( id = c(1, 2, 3, 55), sex = c("M", "F", "F", "M"), height = c(6.1, 5.5, 5.2, 5.9) ) ``` ## Example: Determine Class and Structure ```{r looking-into-data} class(subject_heights) str(subject_heights) ``` ## Exercise: Running `str()` and `class()` on `id` ```{r} id = c(1, 2, 3, 55) class(id) str(id) ``` ```{r} id_int = c(1L, 2L, 3L, 55L) class(id_int) str(id_int) ``` # Vectorization ## Example: Vectorization and Elements Simultaneously calculating multiple points. ```{r vectorized-addition} x = c(1, 2, 3, 4) y = c(5, 6, 7, 8) z = x + y z ``` ## Example: Vectorized Binary Operators _R_ has multiple **binary** operators built-in to speed up calculations. ```{r example-of-ops} x = c(1, 2, 3, 4) y = c(5, 6, 7, 8) x + y # Addition x - y # Subtraction x * y # Multiplication x / y # Division x ^ y # Exponentiation x %/% y # Integer Division x %% y # Modulus ``` ### Aside: Modulus The _modulus_ operator computes the remainder term of a division. $$a \mod q$$ ```{r mod-ex} 12 %% 7 # a = n*q + r => 12 = 1*7 + 5 outer(9:1, 2:9, `%%`) # Compute the cross between X & Y ``` ## Example: Recycling Handling length "mis-matches"... ```{r recycle-process} a = c(1, 2, 3, 4) length(a) b = c(5, 6, 7) length(b) a + b ``` ## Example: Recycling - Round 2 What happens if the shorter vector is an even multiple of the longer vector? ```{r expansion-shorter} c(1, 2, 3, 4) + c(-1, 1) ``` ## Exercise: Determining Scalars Explain what happens if we have a vector and add a single value ```{r whats-a-scalar} a = 2 x = c(1, 2, 3, 4) x + a ``` ### Exercise: Recycle a value for a Confidence Interval ```{r} p_hat = 0.6 n = 110 z_crit = qnorm(0.975) p_hat + c(-1, 1) * z_crit * sqrt(p_hat * (1-p_hat) / n) ``` ## Example: Everything is a Vector ```{r etia} a = 2 length(a) a_vec = c(2) length(a_vec) ``` ```{r eq-check} identical(a, a_vec) ``` # Subsets Selecting a smaller amount of data. ## Example: Positional Indexes ```{r ex-vector} ex_vec = c(5, 3, -2, 42) ``` ## Example: Retrieving a Single Value ```{r retrieve-first} ex_vec = c(5, 3, -2, 42) # Retrieve first element ex_vec[1] # Retrieve second element ex_vec[4] # Retrieve the nth element last_pos = length(ex_vec) ex_vec[last_pos] ``` ## Example: Retrieve Multiple Values ```{r retrieve-seq} ex_vec = c(5, 3, -2, 42) ex_vec[c(2, 3)] ex_vec[2:3] ``` ## Example: Retrieve Multiple Values by Removing Indices ```{r neg-seq} ex_vec = c(5, 3, -2, 42) ex_vec[-c(1, 4)] ``` ## Example: Named Access Retrieval ```{r named-access} # Create example vector ex_vec = c(5, 3, -2, 42) # Set the element names names(ex_vec) = c( "a", "b", "c", "d" ) # Select element "b" ex_vec["b"] # Retrieve the element names names(ex_vec) ``` ## Example: Generating Indices There are _many_ ways to create the positional indices for each vector. ```{r sample-index-creation} # Construct an example # vector ex_vec = c(5, 3, -2, 42) # Create indices 1:length(ex_vec) seq(1, length(ex_vec)) seq_len(length(ex_vec)) seq_along(ex_vec) ``` ## Exercise: Positional Index Methods Using all sequence methods, create sequences for the following vectors. Are all approaches the same? ```{r} int_vec = c(8L, -2L, 5L, 0L) empty_vec = numeric(0) ``` ```{r} # Filled vectors 1:length(int_vec) ``` ```{r} # An empty vector empty_vec 1:length(empty_vec) length(empty_vec) empty_vec[0] empty_vec[1] ``` ```{r} 1:length(empty_vec) length(empty_vec) 1:0 c(1, 0 ) seq_len(length(empty_vec)) seq_along(empty_vec) ```