---
title: "Regular Expressions"
author: "JJB + Course"
date: "04/10/2019"
output:
html_document:
toc: true
toc_float:
collapsed: false
---
# Regular Expressions
## Useful
[regex tester](https://regex101.com/)
```bash
Study # case-sensitive used a capital
study # case-sensitive matched 1-to-1 with literal string
```
## Example: Demoing text structuring
Consider a list of variable information:
1. Class: DIE, LIVE
2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
3. SEX: male, female
Inside of _R_, variable are documented for data using the `roxygen2`
inline documentation features, e.g. creating comments with `#'`. The
sample skeleton would be:
```{r}
#' - `variable_name`
```
By hand, this would look like:
```{r}
#' - `Class`
#' - `AGE`
#' - `SEX`
```
But, what happens when we have multiple variables?
1. Class: DIE, LIVE
2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
3. SEX: male, female
4. STEROID: no, yes
5. ANTIVIRALS: no, yes
6. FATIGUE: no, yes
7. MALAISE: no, yes
8. ANOREXIA: no, yes
9. LIVER BIG: no, yes
10. LIVER FIRM: no, yes
11. SPLEEN PALPABLE: no, yes
12. SPIDERS: no, yes
13. ASCITES: no, yes
14. VARICES: no, yes
15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
-- see the note below
16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
17. SGOT: 13, 100, 200, 300, 400, 500,
18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
20. HISTOLOGY: no, yes
Solution: **Use a regex search and replace.**
Regex search pattern:
```bash
[0-9]{1,2}\. (.*):.*
```
Regex replace pattern:
```bash
#' - `(\1)`
```
## Example: Log Search
Example on Regex101: https://regex101.com/r/DfqZ7x/1
Regular Expression for retrieving IP Addresses:
```
^([[:digit:]]{1,3}\.?)+
```
Sample data taken from log on the RStudio Server subdomain run by statistics.
```
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /shiny/stat430ag/ HTTP/1.1" 403 994 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /favicon.ico HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-120x120.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
128.84.124.206 - - [09/Nov/2018:09:12:38 -0600] "GET /apple-touch-icon.png HTTP/1.1" 404 152 "-" "MobileSafari/604.1 CFNetwork/975.0.3 Darwin/18.2.0"
```
### Example: Finding a String
```{r}
library("stringr")
pattern = "s"
area = "cats and dogs"
# Find all instances of s
str_view_all(area, pattern = pattern)
# Find the first instance of s
str_view(area, pattern = pattern)
```
### Example: One literal character
```{r}
x = c("did you lie to me?",
"all lies",
"are you lying?",
"lying on the couch")
library("stringr")
str_detect(x, pattern = "lie")
str_detect(x, pattern = "you")
```
### Example: Viewing Matches
```{r}
x = c("did you lie to me?",
"all lies",
"are you lying?",
"lying on the couch")
str_view_all(x, pattern = "lie")
```
```{r}
str_view_all(x, pattern = "you")
```
### Example: Multiple Literal Characters
```{r}
x = c("did you lie to me?",
"all lies",
"are you lying?",
"lying on the couch")
library("stringr")
str_detect(x, pattern = "lie|you")
str_detect(x, pattern = "(lie)|(you)")
```
### Exercises
1. Find instances of `UIUC` or `UofI`
```{r}
y = c("UNR", "UNC", "UofI", "UIUC", "UI")
# Detect the matches
str_detect(y, pattern = "UIUC|UofI")
# Subset the matching patterns
str_view(y, pattern = "UIUC|UofI")
# Does the case of the pattern matter?
# regular expressions are case _________
# that is UPPER CASE _____ match lower case.
```
2. Determine if a city is in a state (e.g. `IL`):
```{r}
library("stringr")
x = c("Chicago, IL", "San Fran, CA", "Iowa City, IA", "Urbana, IL",
"Wheaton, IL", "Myrtle Beach, SC")
str_detect(x, pattern = "il")
# Capitals
str_detect(x, pattern = "IL")
```
## Example: Dealing with Special Patterns
```{r, eval = FALSE}
# Sample String Data
x = c("did you lie to me?",
"all lies!",
"are you lying?",
"lying on my couch")
str_detect(x, pattern = "\?")
str_detect(x, pattern = "\\?")
```
### Recall: Escape Characters
```{r, eval = FALSE}
"my string \" quote "
```
```{r, eval = FALSE}
"my string \\" quote "
```
```{r, eval = FALSE}
"my string \\\" quote "
cat("my string \\\\ quote ")
```
### Exercise: Special Patterns
```{r}
library("stringr")
x = c("3 + 4 = 7", "1 / 4 = 0.25", "2 * 4 = 8", "3 * 4",
"Algebra is fun?", "Green Eggs and\\or Ham")
# Detecting a backlash
str_detect(x, pattern = "\\+")
# No backslash required to escape string
str_detect(x, pattern = "/")
# Detecting two backslashes requires more escape characters
str_detect(x, pattern = "\\\\")
```
```{r}
# Detection on a +
```
```{r}
# Detection on a \ (backwards slash)
```
```{r}
# Detection on either a + sign or \ (backwards slash)
```
```{r}
# Shows _unescaped_ characters
x
print(x)
# Shows the value of an escaped character.
cat(x)
```
### Example: Character Classes
```{r}
my_text = "Hello World! How are you? I'm hungry. "
# View by itself allows for the FIRST match to occur.
str_view(my_text, pattern = "[Hh]")
# All allows for _multiple_ matches to occur
str_view_all(my_text, pattern = "[Hh]")
# Negating the values inside of a character class
str_view_all(my_text, pattern = "[^Hh]")
# Generally, want to detect the pattern.
str_detect(my_text, pattern = "[Hh]")
```
### Example: Metacharacters
```{r}
# Sample String Data
x = c("lower case values",
"UPPER CASE VALUES",
"MiXtUrE oF vAlUeS")
# Lower case values for a b c
str_detect(x, pattern = "[abc]")
# Upper case values for A B C
str_detect(x, pattern = "[ABC]")
# Range of lower case values
str_detect(x, pattern = "[a-z]")
```
### Example: Metacharacters - Part II
```{r}
x = "Does the wolf have gray or grey hair?"
str_detect(x, pattern = "gr[a|e]y")
y = c("Do we have a toad?", "He's an author.")
str_detect(x, pattern = "a[^n]")
```
### Exercise: Matching Phone Numbers
Write a regex that matches a phone number with:
**###-###-####**
```{r}
phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190")
str_view_all(phone_nums,
pattern = "[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
)
str_view_all(phone_nums,
pattern = "[[:digit:]][[:digit:]][[:digit:]]-[[:digit:]][[:digit:]][[:digit:]]-[[:digit:]][[:digit:]][[:digit:]][[:digit:]]"
)
```
What if we wanted to match a different pattern?
## Exercise: Extra Practice
A good subject ID is given as: `Sxx`, where `S` is the identifier and `xx` is the number.
For example, `S01` would indicate the first subject.
```{r}
# Consider the malformed input of subject IDs
subject_ids = c("S01", # good
"s5", # bad
"8", # worse
"S12" # good
)
# Using pre-defined classes
str_detect(subject_ids,
pattern = "mmm")
# Use rangers in character classes
str_detect(subject_ids,
pattern = "mmmm")
```
## Example: Replacing Values
```{r}
# Sample String Data
x = c("lower case values",
"UPPER CASE VALUES",
"MiXtUrE oF vAlUeS")
# Lower case values for a b c
str_replace(x, pattern = "[abc]", replacement = "!")
# Lower case values for a b c
str_replace_all(x, pattern = "[abc]",
replacement = "!")
# Replace UPPER case values for A B C
str_replace_all(x, pattern = "[ABC]",
replacement = "!")
# Replace all lower case values
str_replace_all(x, pattern = "[a-z]",
replacement = "!")
```
### Example: Comparing single vs. multiple instance
```{r}
# Sample String Data
x = c("I dislike cake. Any cake really.",
"Cake is okay...",
"I love cake... Cake... Cake...",
"I prefer to have pie over cake",
"Mmmm... Pie.")
# Replacing first instance of cake per string
str_replace(x, pattern = "[Cc]ake", replacement = "Pizza")
# Replacing ALL instances of cake
str_replace_all(x, pattern = "[Cc]ake",
replacement = "Pizza")
```
### Exercise: Replacements
1. Find all matches of the word "i" / "I".
2. Remove the word "not".
3. Change the word "Green" to be "Blue".
```{r}
green_eggs = c("I do not like them",
"Sam-I-am.",
"I do not like",
"Green eggs and ham.")
# Detecting the match inside of the string
str_view_all(green_eggs, pattern="[Ii]")
# No matches because we are looking for an Upper I followed by a lowercase i.
str_view_all(green_eggs, pattern="[I][i]")
str_view_all(green_eggs, pattern = regex("i", ignore_case = TRUE))
```
```{r}
# Visualize the match
```
```{r}
# Replace not with an empty string
str_replace(green_eggs,
pattern="not", # Just detect space
replacement = "") # insert empty string
# Two spaces back to back.
str_replace(green_eggs,
pattern=" not", # Detect space with not
replacement = "")
# We get a single space being present.
```
### Example: Quantifiers
```{r}
# Sample String Data
x = c("Teddy",
"Hey",
"Heyy",
"Heyyy",
"Heyyyy")
# Find at least 1 to 3 y's together
str_extract(x, pattern = "y{1,3}")
# Find one or more "yy" groups
str_extract(x, pattern = "(yy)+")
# Find zero or more
str_detect(x, pattern = "x*")
# Find 1 or more
str_detect(x, pattern = "x+")
```
## Example: Redux phone numbers
```{r}
phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190")
str_detect(phone_nums, "[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}")
```
"217-333-2167"
"217 244-7190"
^
```{r}
str_detect(phone_nums, "[[:digit:]]{3}[\\- ][[:digit:]]{3}-[[:digit:]]{4}")
```
### Exercises: Quantifiers
Require two consecutive numbers
```{r exercise-quantifiers}
two_nums = c("T-800 Model 101", "Sky Diving", "Coffee&Tea", "STAT 385")
# First match in character vector
str_view_all(two_nums,
pattern = "[[:digit:]]{2}")
```
```{r quantifiers-with-lists}
# All matches in list form.
```
Require an upper case followed by a lower case
```{r case-change}
upper_v_lower = c("Up", "i gotta feeling", "skyfall", "R2D2", "down2Night")
# Notice two character classes
str_view_all(upper_v_lower,
pattern = "[[:upper:]]{1}[[:lower:]]{1}")
```
```{r single-character-class}
# What would happen if we only used one character class?
```
In short, many students forget to double bracket `[[` the predefined character classes, e.g. instead of `[[:digit:]]` they prefer `[:digit:]`. This is problematic
for compatibility with Base R's set of regular expressions and may lead to
incorrect groupings of two values.
The difference between syntax is largely due to stringr using a different regular
expressions library. Details can be found here: https://github.com/tidyverse/stringr/issues/236
```{r base-r-equiv}
## Base R regular expression functions
# Here the correct pattern is recovered
grep(pattern = "[[:upper:]]", upper_v_lower)
# We are picking up either colon (:), u, p, e, r
grep(pattern = "[:upper:]", upper_v_lower, value =TRUE)
```
## Example: Greediness vs. Laziness on Text
```{r}
# Greedy
str_extract("stackoverflow", pattern = "s(.*)o")
# Lazy
str_extract("stackoverflow", pattern = "s(.*?)o")
```
## Example: Greediness vs. Laziness on Semi-structured Data
```{r greedy-vs-lazy}
html_txt = " Hi "
# What pattern is this?
str_extract(html_txt, pattern = "(.*)")
str_extract(html_txt, pattern = "(.*?)")
```
## Example: Extracting and Replacing Capture Group
```{r capture-group-replacement}
# Sample String Data
x = c("00:00:00 - 00:00:05 (5 sec)",
"00:00:05 - 00:00:35 (30 sec)",
"00:00:35 - 00:00:51 (16 sec)")
```
```{r}
# Extract end time stamp and replace string with it.
str_replace(x,
pattern = ".*-[[:space:]](.*)[[:space:]]\\(.*",
replacement ="\\1") # ^^ taken from here
```
```{r}
# Extract time in seconds and replace string with it.
str_replace(x,
pattern = ".*\\(([0-9]+).*",
replacement ="\\1") # ^^^^ taken from here
```
### Example: Grouped Patterns
```{r}
# Sample String Data
x = c("pineapple",
"apple",
"eggplant",
"blackberry",
"apricot",
"nectarine")
# Find consecutively similar letters
str_extract(x,
pattern = "(.)\\1"
)
# Find repeated pattern of values
str_extract(x,
pattern = "(..).*\\1"
)
```
### Example: Replacement using Grouped Pattern Values
```{r}
# Sample String Data
x = c("STAT 400",
"MATH 461",
"CS 225",
"525")
# Change all courses to STAT
str_replace(x,
pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})",
replacement = "STAT \\2"
)
# Change all course numbers to 410
str_replace(x,
pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})",
replacement = "\\1 410"
)
```
### Example: Extracting Matched Patterns
```{r}
x = c("STAT 400",
"MATH 461",
"CS 225",
"525")
str_match(x,
pattern = "([[:upper:]]{2,4}) [[:digit:]]{3}"
)
# Extract matching patterns and groups
str_match(x,
pattern = "([[:upper:]]{2,4}) ([[:digit:]]{3})"
)
```
### Exercise: Retrieving phone digit numbers
Make sure _all_ phone numbers can be found.
(217) 333-2167
217-333-2167
217 244-7190
```{r}
phone_nums = c("(217) 333-2167", "217-333-2167", "217 244-7190")
```
```{r}
# String matching
```
```{r}
# string replacement
```
## Example: Bounded
```{r}
# Sample String Data
x = c("1 second to 12 AM",
"15300",
"19,000",
"Time to go home",
"home on the range")
# Must start with a number
str_detect(x, pattern = "^[0-9]")
# Must end with lower case
str_detect(x, pattern = "[a-z]$")
# Only alphabetic and space characters
str_detect(x, pattern = "^[a-zA-Z[:space:]]+$")
# Only numbers
str_detect(x, pattern = "^[0-9]+$")
```
### Exercise: Dealing with boundaries
1. Find punctuation at the end of a string
2. Find a capital letter at the start of the string
3. Combine both 1. and 2.
```{r}
x = c("Today is a good day", "Tomorrow is better!", "Call me!",
"When can we talk?", "Fly Robbin fly",
"not really.")
```