--- title: "Regular Expressions" author: "JJB + Course" date: "04/10/2019" output: html_document: toc: true toc_float: collapsed: false --- # Regular Expressions ## Useful [regex tester](https://regex101.com/) ```bash Study # case-sensitive used a capital study # case-sensitive matched 1-to-1 with literal string ``` ## Example: Demoing text structuring Consider a list of variable information: 1. Class: DIE, LIVE 2. AGE: 10, 20, 30, 40, 50, 60, 70, 80 3. SEX: male, female Inside of _R_, variable are documented for data using the `roxygen2` inline documentation features, e.g. creating comments with `#'`. The sample skeleton would be: ```{r} #' - `variable_name` ``` By hand, this would look like: ```{r} #' - `Class` #' - `AGE` #' - `SEX` ``` But, what happens when we have multiple variables? 1. Class: DIE, LIVE 2. AGE: 10, 20, 30, 40, 50, 60, 70, 80 3. SEX: male, female 4. STEROID: no, yes 5. ANTIVIRALS: no, yes 6. FATIGUE: no, yes 7. MALAISE: no, yes 8. ANOREXIA: no, yes 9. LIVER BIG: no, yes 10. LIVER FIRM: no, yes 11. SPLEEN PALPABLE: no, yes 12. SPIDERS: no, yes 13. ASCITES: no, yes 14. VARICES: no, yes 15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00 -- see the note below 16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250 17. SGOT: 13, 100, 200, 300, 400, 500, 18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0 19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90 20. HISTOLOGY: no, yes Solution: **Use a regex search and replace.** Regex search pattern: ```bash [0-9]{1,2}\. (.*):.* ``` Regex replace pattern: ```bash #' - `(\1)` ``` ## Example: Log Search Example on Regex101: https://regex101.com/r/DfqZ7x/1 Regular Expression for retrieving IP Addresses: ``` ^([[:digit:]]{1,3}\.?)+ ``` Sample data taken from log on the RStudio Server subdomain run by statistics. ``` 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /shiny/stat430ag/ HTTP/1.1" 403 994 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /favicon.ico HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" 128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0" ``` ### Example: Finding a String ```{r} library("stringr") pattern = "s" area = "cats and dogs" # Find all instances of s str_view_all(area, pattern = pattern) # Find the first instance of s str_view(area, pattern = pattern) ``` ### Example: One literal character ```{r} x = c("did you lie to me?", "all lies", "are you lying?", "lying on the couch") library("stringr") str_detect(x, pattern = "lie") str_detect(x, pattern = "you") ``` ### Example: Viewing Matches ```{r} x = c("did you lie to me?", "all lies", "are you lying?", "lying on the couch") str_view_all(x, pattern = "lie") ``` ```{r} str_view_all(x, pattern = "you") ``` ### Example: Multiple Literal Characters ```{r} x = c("did you lie to me?", "all lies", "are you lying?", "lying on the couch") library("stringr") str_detect(x, pattern = "lie|you") str_detect(x, pattern = "(lie)|(you)") ``` ### Exercises 1. Find instances of `UIUC` or `UofI` ```{r} y = c("UNR", "UNC", "UofI", "UIUC", "UI") # Detect the matches str_detect(y, pattern = "UIUC|UofI") # Subset the matching patterns str_view(y, pattern = "UIUC|UofI") # Does the case of the pattern matter? # regular expressions are case _________ # that is UPPER CASE _____ match lower case. ``` 2. Determine if a city is in a state (e.g. `IL`): ```{r} library("stringr") x = c("Chicago, IL", "San Fran, CA", "Iowa City, IA", "Urbana, IL", "Wheaton, IL", "Myrtle Beach, SC") str_detect(x, pattern = "il") # Capitals str_detect(x, pattern = "IL") ``` ## Example: Dealing with Special Patterns ```{r, eval = FALSE} # Sample String Data x = c("did you lie to me?", "all lies!", "are you lying?", "lying on my couch") str_detect(x, pattern = "\?") str_detect(x, pattern = "\\?") ``` ### Recall: Escape Characters ```{r, eval = FALSE} "my string \" quote " ``` ```{r, eval = FALSE} "my string \\" quote " ``` ```{r, eval = FALSE} "my string \\\" quote " cat("my string \\\\ quote ") ``` ### Exercise: Special Patterns ```{r} library("stringr") x = c("3 + 4 = 7", "1 / 4 = 0.25", "2 * 4 = 8", "3 * 4", "Algebra is fun?", "Green Eggs and\\or Ham") # Detecting a backlash str_detect(x, pattern = "\\+") # No backslash required to escape string str_detect(x, pattern = "/") # Detecting two backslashes requires more escape characters str_detect(x, pattern = "\\\\") ``` ```{r} # Detection on a + ``` ```{r} # Detection on a \ (backwards slash) ``` ```{r} # Detection on either a + sign or \ (backwards slash) ``` ```{r} # Shows _unescaped_ characters x print(x) # Shows the value of an escaped character. cat(x) ``` ### Example: Character Classes ```{r} my_text = "Hello World! How are you? I'm hungry. " # View by itself allows for the FIRST match to occur. str_view(my_text, pattern = "[Hh]") # All allows for _multiple_ matches to occur str_view_all(my_text, pattern = "[Hh]") # Negating the values inside of a character class str_view_all(my_text, pattern = "[^Hh]") # Generally, want to detect the pattern. str_detect(my_text, pattern = "[Hh]") ``` ### Example: Metacharacters ```{r} # Sample String Data x = c("lower case values", "UPPER CASE VALUES", "MiXtUrE oF vAlUeS") # Lower case values for a b c str_detect(x, pattern = "[abc]") # Upper case values for A B C str_detect(x, pattern = "[ABC]") # Range of lower case values str_detect(x, pattern = "[a-z]") ``` ### Example: Metacharacters - Part II ```{r} x = "Does the wolf have gray or grey hair?" str_detect(x, pattern = "gr[a|e]y") y = c("Do we have a toad?", "He's an author.") str_detect(x, pattern = "a[^n]") ``` ### Exercise: Matching Phone Numbers Write a regex that matches a phone number with: **###-###-####** ```{r} phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190") str_view_all(phone_nums, pattern = "[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]" ) str_view_all(phone_nums, pattern = "[[:digit:]][[:digit:]][[:digit:]]-[[:digit:]][[:digit:]][[:digit:]]-[[:digit:]][[:digit:]][[:digit:]][[:digit:]]" ) ``` What if we wanted to match a different pattern? ## Exercise: Extra Practice A good subject ID is given as: `Sxx`, where `S` is the identifier and `xx` is the number. For example, `S01` would indicate the first subject. ```{r} # Consider the malformed input of subject IDs subject_ids = c("S01", # good "s5", # bad "8", # worse "S12" # good ) # Using pre-defined classes str_detect(subject_ids, pattern = "mmm") # Use rangers in character classes str_detect(subject_ids, pattern = "mmmm") ``` ## Example: Replacing Values ```{r} # Sample String Data x = c("lower case values", "UPPER CASE VALUES", "MiXtUrE oF vAlUeS") # Lower case values for a b c str_replace(x, pattern = "[abc]", replacement = "!") # Lower case values for a b c str_replace_all(x, pattern = "[abc]", replacement = "!") # Replace UPPER case values for A B C str_replace_all(x, pattern = "[ABC]", replacement = "!") # Replace all lower case values str_replace_all(x, pattern = "[a-z]", replacement = "!") ``` ### Example: Comparing single vs. multiple instance ```{r} # Sample String Data x = c("I dislike cake. Any cake really.", "Cake is okay...", "I love cake... Cake... Cake...", "I prefer to have pie over cake", "Mmmm... Pie.") # Replacing first instance of cake per string str_replace(x, pattern = "[Cc]ake", replacement = "Pizza") # Replacing ALL instances of cake str_replace_all(x, pattern = "[Cc]ake", replacement = "Pizza") ``` ### Exercise: Replacements 1. Find all matches of the word "i" / "I". 2. Remove the word "not". 3. Change the word "Green" to be "Blue". ```{r} green_eggs = c("I do not like them", "Sam-I-am.", "I do not like", "Green eggs and ham.") # Detecting the match inside of the string str_view_all(green_eggs, pattern="[Ii]") # No matches because we are looking for an Upper I followed by a lowercase i. str_view_all(green_eggs, pattern="[I][i]") str_view_all(green_eggs, pattern = regex("i", ignore_case = TRUE)) ``` ```{r} # Visualize the match ``` ```{r} # Replace not with an empty string str_replace(green_eggs, pattern="not", # Just detect space replacement = "") # insert empty string # Two spaces back to back. str_replace(green_eggs, pattern=" not", # Detect space with not replacement = "") # We get a single space being present. ``` ### Example: Quantifiers ```{r} # Sample String Data x = c("Teddy", "Hey", "Heyy", "Heyyy", "Heyyyy") # Find at least 1 to 3 y's together str_extract(x, pattern = "y{1,3}") # Find one or more "yy" groups str_extract(x, pattern = "(yy)+") # Find zero or more str_detect(x, pattern = "x*") # Find 1 or more str_detect(x, pattern = "x+") ``` ## Example: Redux phone numbers ```{r} phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190") str_detect(phone_nums, "[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}") ``` "217-333-2167" "217 244-7190" ^ ```{r} str_detect(phone_nums, "[[:digit:]]{3}[\\- ][[:digit:]]{3}-[[:digit:]]{4}") ``` ### Exercises: Quantifiers Require two consecutive numbers ```{r exercise-quantifiers} two_nums = c("T-800 Model 101", "Sky Diving", "Coffee&Tea", "STAT 385") # First match in character vector str_view_all(two_nums, pattern = "[[:digit:]]{2}") ``` ```{r quantifiers-with-lists} # All matches in list form. ``` Require an upper case followed by a lower case ```{r case-change} upper_v_lower = c("Up", "i gotta feeling", "skyfall", "R2D2", "down2Night") # Notice two character classes str_view_all(upper_v_lower, pattern = "[[:upper:]]{1}[[:lower:]]{1}") ``` ```{r single-character-class} # What would happen if we only used one character class? ``` In short, many students forget to double bracket `[[` the predefined character classes, e.g. instead of `[[:digit:]]` they prefer `[:digit:]`. This is problematic for compatibility with Base R's set of regular expressions and may lead to incorrect groupings of two values. The difference between syntax is largely due to stringr using a different regular expressions library. Details can be found here: https://github.com/tidyverse/stringr/issues/236 ```{r base-r-equiv} ## Base R regular expression functions # Here the correct pattern is recovered grep(pattern = "[[:upper:]]", upper_v_lower) # We are picking up either colon (:), u, p, e, r grep(pattern = "[:upper:]", upper_v_lower, value =TRUE) ``` ## Example: Greediness vs. Laziness on Text ```{r} # Greedy str_extract("stackoverflow", pattern = "s(.*)o") # Lazy str_extract("stackoverflow", pattern = "s(.*?)o") ``` ## Example: Greediness vs. Laziness on Semi-structured Data ```{r greedy-vs-lazy} html_txt = " Hi " # What pattern is this? str_extract(html_txt, pattern = "(.*)") str_extract(html_txt, pattern = "(.*?)") ``` ## Example: Extracting and Replacing Capture Group ```{r capture-group-replacement} # Sample String Data x = c("00:00:00 - 00:00:05 (5 sec)", "00:00:05 - 00:00:35 (30 sec)", "00:00:35 - 00:00:51 (16 sec)") ``` ```{r} # Extract end time stamp and replace string with it. str_replace(x, pattern = ".*-[[:space:]](.*)[[:space:]]\\(.*", replacement ="\\1") # ^^ taken from here ``` ```{r} # Extract time in seconds and replace string with it. str_replace(x, pattern = ".*\\(([0-9]+).*", replacement ="\\1") # ^^^^ taken from here ``` ### Example: Grouped Patterns ```{r} # Sample String Data x = c("pineapple", "apple", "eggplant", "blackberry", "apricot", "nectarine") # Find consecutively similar letters str_extract(x, pattern = "(.)\\1" ) # Find repeated pattern of values str_extract(x, pattern = "(..).*\\1" ) ``` ### Example: Replacement using Grouped Pattern Values ```{r} # Sample String Data x = c("STAT 400", "MATH 461", "CS 225", "525") # Change all courses to STAT str_replace(x, pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})", replacement = "STAT \\2" ) # Change all course numbers to 410 str_replace(x, pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})", replacement = "\\1 410" ) ``` ### Example: Extracting Matched Patterns ```{r} x = c("STAT 400", "MATH 461", "CS 225", "525") str_match(x, pattern = "([[:upper:]]{2,4}) [[:digit:]]{3}" ) # Extract matching patterns and groups str_match(x, pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})" ) ``` ### Exercise: Retrieving phone digit numbers Make sure _all_ phone numbers can be found. (217) 333-2167 217-333-2167 217 244-7190 ```{r} phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190") ``` ```{r} # String matching ``` ```{r} # string replacement ``` ## Example: Bounded ```{r} # Sample String Data x = c("1 second to 12 AM", "15300", "19,000", "Time to go home", "home on the range") # Must start with a number str_detect(x, pattern = "^[0-9]") # Must end with lower case str_detect(x, pattern = "[a-z]$") # Only alphabetic and space characters str_detect(x, pattern = "^[a-zA-Z[:space:]]+$") # Only numbers str_detect(x, pattern = "^[0-9]+$") ``` ### Exercise: Dealing with boundaries 1. Find punctuation at the end of a string 2. Find a capital letter at the start of the string 3. Combine both 1. and 2. ```{r} x = c("Today is a good day", "Tomorrow is better!", "Call me!", "When can we talk?", "Fly Robbin fly", "not really.") ```