--- title: "Relational Data and Joins" author: "JJB + Course" date: "03/27/2019" output: html_document: toc: true toc_float: collapsed: false --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Split Data ## Example: Student Data ```{r student-data-init} Students = data.frame(id = 1:3, firstname = c("Billy", "Theodore", "Keeya"), lastname = c("Joe", "Squirrel", "Nod"), age = c(23, 25, 21), instate = c(FALSE, TRUE, TRUE)) Grades = data.frame(student_id = c(1, 2, 1, 3), course_id = c("STAT385", "STAT 432", "HIST100", "STAT385"), grade = c("A+", "A-", "A", "B+")) Courses = data.frame(course_id = c("STAT 385", "STAT 432", "HIST100"), acronym = c("SPM", "BSL", "GH")) ``` ```{r student-data-left-join} library("dplyr") dplyr::left_join(Students, Grades, by = c("id" = "student_id") ) ``` # Naive Joins ## Example: Exact Order ```{r exact-order-join} # Same number of rows, # exact ordering, and # no repeated columns. first_df = data.frame( UID = c(1, 2, 3, 4), Grade = c("A", "B", "C", "A")) sec_df = data.frame( # UID = c(1, 2, 3, 4), # Data already ordered Exams = c(38.4, 39.9, 40, 20.5) ) # Merge the data together merged_df = data.frame( first_df, sec_df ) # Or, bind by column merged_df_cols = cbind(first_df, sec_df) # Retrieve specific columns # with the same order selected_df = data.frame(first_df$UID, sec_df$Exams) ``` ## Example: Unordered Data Join ```{r unordered-data-join-issue} # Same number of rows, # unordered data, # and repeated columns. bad_first_df = data.frame( UID = c(4, 3, 2, 1), Grade = c("A", "C", "B", "A") ) bad_sec_df = data.frame( UID = c(2, 1, 4, 3), Exams = c(39.9, 38.4, 20.5, 40) ) # Merge data bad_merged_df = data.frame( bad_first_df, bad_sec_df ) # That doesn't look right... bad_merged_df ``` ### Example: Ordering for Naive Joins ```{r naive-join-ordering} # Same number of rows, exact ordering, # no repeated columns. bad_first_df = data.frame( UID = c(4, 3, 2, 1), Grade = c("A", "C", "B", "A") ) bad_sec_df = data.frame( UID = c(2, 1, 4, 3), Exams = c(39.9, 38.4, 20.5, 40) ) # Order data frames ordered_first_df = bad_first_df[order(bad_first_df$A), ] ordered_sec_df = bad_sec_df[order(bad_sec_df$A), ] # Combine the ordered data frames ordered_merged_df = data.frame( ordered_first_df$A, ordered_first_df$B, ordered_sec_df$D ) ``` ### Exercise: Ordering Response IDs Why might we want to use `order()` over `sort()`? ```{r order-vs-sort} ``` In psychometrics, often we use randomization to generate a problem bank of items and present them to a student. As a result, students answer each item in different orders. When the researcher decided to run the analysis they needed to standardize it. How can we standardize each attempt? ```{r psychometric-items} item_id = matrix(c(4, 2, 3, 1, 1, 2, 4, 3, 2, 4, 3, 1, 1, 2, 3, 4), nrow = 4, byrow = T) response_id = matrix(c(1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1), nrow = 4, byrow = T) standardized_items = response_id ``` # Mutating Joins ## Example: Point/Grade Data ```{r x-and-y-data} X = data.frame(Key = 1:3, Points = c(90, 84, 75)) Y = data.frame(Key = c(1,2,4), Letters = c("A", "B", "D")) ``` ## Example: Inner Join ```{r inner-join} dplyr::inner_join(X, Y, by = "Key") merge(X, Y, by = "Key") ``` ## Example: Full ```{r full-join} dplyr::full_join(X, Y, by = "Key") merge(X, Y, by = "Key", all.x = TRUE, all.y = TRUE) ``` ## Example: Left ```{r left-join} dplyr::left_join(X, Y, by = "Key") merge(X, Y, by = "Key", all.x = TRUE) ``` ## Example: Right ```{r right-join} dplyr::right_join(X, Y, by = "Key") merge(X, Y, by = "Key", all.y = TRUE) ``` ### Exercise: Join Equivalence Would the following joins be equivalent? If so, why? ```{r join-left-right-equivalence, eval = FALSE} dplyr::left_join(X, Y, by = "Key") dplyr::right_join(Y, X, by = "Key") ``` ### Exercise: Student Data Join together the different tables in the student database. ```{r} Students Grades Courses ``` # Example: Filtering Joins ```{r} dplyr::anti_join(X, Y, by = "Key") ``` ```{r} dplyr::semi_join(X, Y, by = "Key") ``` ### Exercise: Antijoins Install the `fueleconomy` package. Determine the appropriate keys between `common` and `vehicles` tables in `fueleconomy`. Perform a semi join ```{r fuel-econ} # install.packages("fueleconomy") ``` # Set Operations ## Example: Set Operations ```{r set-operations} x = c(-8, 0, 2, 1, 23, NA) y = c(-8, 3, 1, NA, 2, 10) union(x, y) # X or Y (Full) intersect(x, y) # X and Y (Intersect) setdiff(x, y) # Y - X (Anti-join) setdiff(y, x) # X - Y (Anti-join) setequal(x, y) # X = Y is.element(x, y) # X in Y (Intersect) x %in% y # equivalent ```