--- title: "Unstructured Data" author: "JJB + Course" output: html_document: toc: true toc_float: collapsed: false --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Unstructured Data ## Example: Characters ```{r individual-characters} 'a' 'b' 'c' 'D' 'E' 'F' '1' '2' '3' '4' ' ' '*' ',' ' " ' ``` ## Example: Strings ```{r many-characters} 'UIUC' 'STAT' 'Chambana' 'Chicago' 'Illinois' ``` ## Example: Character vs. String Type ```{r} class("J") class("James") ``` ## Example: Escape Characters ```{r} double_quote = "Hello World!" single_quote = 'Hello World!' complex_string = "It's happening!" escape_string = 'It\'s happening!' white_space = " " empty_string = "" ``` ## Example: Escape characters in action! ```{r} message("Hello World\nMy name is Ted.") message("Here is a quote: \"The World is watching!\"") message('Here is a quote: \'The World is watching!\'') ``` ## Exercise: Writing a string > Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn't equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness. > > -- Patrick Burns, R-help (2005) ```{r} # Using single quotes to denote a statement # requires all apostrophe to be escaped using a # backslash. # Notice backslash doesn't appear in out. # message with double quotations # Notice backslash doesn't appear in out. ``` # String Ops ## Example: Length and Characters ```{r} length("toad") nchar("toad") ex_string = c("toad", "eoh", "r") length(ex_string) nchar(ex_string) ``` ## Example: Modifying Case ```{r modifying-case} # Same string, different capitalization # Not equivalent. "sTaT 385 at UiUc" == "stat 385 at uiuc" # Same string, same capitalization # viewed as equivalent "sTaT 385 at UiUc" == "sTaT 385 at UiUc" # Move to lowercase tolower("sTaT 385 at UiUc") # Capitalize each letter toupper("sTaT 385 at UiUc") ``` ```{r difference-low-to-high} x = "caps is the highway to coolsville 8)" toupper(x) ``` ## Example: Input Cases Consider the different case tense that users may use to respond: ```{r bad-input-control, eval = FALSE} repeat { x = readline("Do wish to stop the loop?") if (x == "yes") { break } } ``` This will fail if a user types `YES`. To handle this, let's convert all input into a single case `tolower()` and `toupper()`. ```{r good-input-control, eval = FALSE} repeat{ x = readline("Do wish to stop the loop?") if (tolower(x) == "yes") { break } } ``` ## Example: Concatenating Strings ```{r concatentation} your_name = "James" paste("Hello World to you", your_name, "!") # Add spaces around each word paste0("Hello World to you", your_name, "!") # Remove spaces # Equivalent to paste0() paste("Hello World to you", your_name, "!", sep = "") ``` Another way to see how `sep=` works is given as: ```{r concatentation-sep-modifiers} paste("STAT 385", "UIUC", "IL", sep = " @ ") ``` Lastly, the vector can be collapsed into _one_ value using `collapse=`. ```{r concatentation-collapsing-values} x = 1:10 y = 2:11 paste(x, "+", y, collapse = " - ") ``` ## Exercise: Making a remainder statement ```{r remainder-statement} x = seq_len(5) # x # Modulus mod = 2 remainder = x %% mod ``` ```{r pasting-values} # `paste0` is useful if you want to control how the string is # merged together w.r.t spacing and punctation at the end. # Paste is great if you do not need to add ending punctuation # as it will automatically add spaces between strings and variables. ``` ## Exercise: Counting words in text ```{r and-we-count} my_text = "Well thank you very much, everybody. I am honored to be here with our incredible steel and aluminum workers. And you are truly the backbone of America. You know that. Very special people. I've known you and people that are very closely related to you for a long time. You know that. I think it's probably the reason I'm here. So I want to thank you." ``` ```{r} my_text length(my_text) original_number_of_chars = nchar(my_text) tokenization_words = gsub("[[:space:]]", "", my_text) words_without_spaces = nchar(tokenization_words) original_number_of_chars - words_without_spaces ``` ## Example: Concatenating Strings (Vectorized) ```{r} subject_ids = seq_len(5) paste0("S", subject_ids) paste0("S", subject_ids, sep = "-") paste0("S", subject_ids, collapse = "") ``` ## Example: Substring ```{r} substr("stat", 1, 2) substr("Illinois", 4, 8) substr("coding", 7, 10) substr(c("stat", "Illinois"), 1:2, 3:4) ``` ### Exercise: Transform the first letter in every string to a capital ```{r} x = c("mumford", "female", "male", "joe", "pete") # This obtains the first position substr("mumford", 1, 1) # Modify the case first_letter = substr("mumford", 1, 1) single_word = "mumford" substr(single_word, 1, 1) = toupper(first_letter) single_word ``` ```{r} substr(x, 1, 1) = toupper(substr(x, 1, 1)) # Show updated elements x paste0(toupper(substr(x, 1, 1)), substr(x, 2, nchar(x))) ``` ## Example: Split String ```{r} dishes = c("Spaghetti and Meatballs", "French Onion Soup") movies = "Star Wars, Up!, Monsters Inc., Black Panther" movies # This cleans up. strsplit(movies, split = ", ") # Note this leaves a space at the end of each character strsplit(movies, split = ",") ``` ```{r} dishes_extended = c("Spaghetti and Meatballs", "French Onion Soup", "Cabbage Soup", "Corn Beef and Cabbage", "Pizza", "Fried Rice", "chicken noodle soup") ``` # Text Mining ## Example: What's your Favorite Color? ```{r gh-web-api-favorite-color, eval = FALSE} # GitHub Web API v3 library("gh") # Functional programming library("purrr") # Sys.setenv("GITHUB_PAT" = "not-telling") # What are we obtaining? owner = "stat385-sp2019" repo = "disc" number = 71 # Download issue text submitters = gh( "GET /repos/:owner/:repo/issues/:number/comments", owner = owner, repo = repo, number = number, .limit = 1000) # Extract responses comment_txt = submitters %>% map_chr("body") # Write to a text file writeLines(comment_txt, con = "color_data.txt") ``` ## Example: Tokenization and Counting ```{r make-tokens-txt} # Text to analyze txt = "What's your name? Mine is James." # Break into tokens # Notice: Output is a list! tokens = strsplit(txt, split = " ") tokens ``` ```{r count-words} # Count words table(tokens) ``` ## Example: Read in Text Data The data inside of the `color_data.txt` file is given in _line_ form: ```md Blue Blue Blue blue Orange ``` This data can be read in using `readLines()`. ```{r text-read-example} color_responses = readLines("color_data.txt") color_responses ``` The data here is given in line form with one word per line. Thus, we can immediately count the values or can we? ```{r see-count-color} table(color_responses) ``` ## Example: Pre-processing ```{r pre-process-recipes-lower} # Convert response to lowercase lower_resp = tolower(color_responses) lower_resp ``` ```{r pre-process-recipes-empty} # Remove empty lines nonempty_resp = lower_resp[ lower_resp != "" # Empty string ] ``` ```{r pre-process-recipes-punct} # Remove punctation no_punct = gsub( "[[:punct:]]", # regex Identifier "", # Empty string nonempty_resp ) ``` ## Example: Monolithic string Create a large string ```{r} combined_str = paste(no_punct, collapse = " ") combined_str ``` Now, let's break the string apart. ```{r} combined_str_split = strsplit( combined_str, split=" " )[[1]] combined_str_split ``` And count again: ```{r} table(combined_str_split) ``` ## Example: Term-Frequency ```{r term-matrix-example} # install.packages("tm") library("tm") corpus = Corpus(VectorSource(combined_str)) tdm = TermDocumentMatrix(corpus) inspect(tdm) ``` ## Example: Word Cloud ```{r example-word-cloud} library("wordcloud") # Obtain counts without going into a DTM. counts = sort(table(combined_str_split), decreasing = TRUE) # Construct a wordcloud wordcloud(names(counts), freq = counts, min.freq=0.01, colors = names(counts), ordered.colors = TRUE) ``` ### Exercise: Term-Frequency What happens to the DTM if we change `combined_str` to `combined_str_split`? ```{r term-matrix-exercise} corpus = Corpus(VectorSource(combined_str)) tdm = TermDocumentMatrix(corpus) inspect(tdm) ```