---
title: "Unstructured Data"
author: "JJB + Course"
output:
   html_document:
     toc: true
     toc_float:
       collapsed: false
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Unstructured Data

## Example: Characters

```{r individual-characters}
'a' 
'b'
'c'
'D'
'E'
'F'

'1'
'2'
'3'
'4'

' '
'*'
','
' " '
```

## Example: Strings

```{r many-characters}
'UIUC'
'STAT'
'Chambana'
'Chicago'
'Illinois'
```

## Example: Character vs. String Type

```{r}
class("J")
class("James")
```

## Example: Escape Characters

```{r}
double_quote = "Hello World!"

single_quote = 'Hello World!'

complex_string = "It's happening!"

escape_string = 'It\'s happening!'

white_space = " "

empty_string = ""
```

## Example: Escape characters in action!

```{r}
message("Hello World\nMy name is Ted.")
message("Here is a quote: \"The World is watching!\"")
message('Here is a quote: \'The World is watching!\'')
```


## Exercise: Writing a string

> Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn't equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness.
>
> -- Patrick Burns, R-help (2005)

```{r}

# Using single quotes to denote a statement
# requires all apostrophe to be escaped using a 
# backslash.

# Notice backslash doesn't appear in out. 


# message with double quotations

# Notice backslash doesn't appear in out. 
```


# String Ops


## Example: Length and Characters

```{r}
length("toad")
nchar("toad")

ex_string = c("toad", "eoh", "r")
length(ex_string)
nchar(ex_string)
```

## Example: Modifying Case

```{r modifying-case}
# Same string, different capitalization
# Not equivalent.
"sTaT 385 at UiUc" == "stat 385 at uiuc"

# Same string, same capitalization
# viewed as equivalent
"sTaT 385 at UiUc" == "sTaT 385 at UiUc"

# Move to lowercase
tolower("sTaT 385 at UiUc")

# Capitalize each letter
toupper("sTaT 385 at UiUc")
```


```{r difference-low-to-high}
x = "caps is the highway to coolsville 8)"
toupper(x)
```

## Example: Input Cases

Consider the different case tense that users may use to respond:

```{r bad-input-control, eval = FALSE}
repeat {
  x = readline("Do wish to stop the loop?")
  
  if (x == "yes") {
    break
  }
}
```


This will fail if a user types `YES`. To handle this, let's convert all input into a
single case `tolower()` and `toupper()`.

```{r good-input-control, eval = FALSE}
repeat{
  x = readline("Do wish to stop the loop?")
  
  if (tolower(x) == "yes") {
    break
  }
}
```


## Example: Concatenating Strings

```{r concatentation}
your_name = "James"

paste("Hello World to you", your_name, "!")   # Add spaces around each word
paste0("Hello World to you", your_name, "!")  # Remove spaces

# Equivalent to paste0()
paste("Hello World to you", your_name, "!", sep = "")
```

Another way to see how `sep=` works is given as:

```{r concatentation-sep-modifiers}
paste("STAT 385", "UIUC", "IL", sep = " @ ")
```

Lastly, the vector can be collapsed into _one_ value using `collapse=`.

```{r concatentation-collapsing-values}
x = 1:10
y = 2:11

paste(x, "+", y, collapse = " - ")
```

## Exercise: Making a remainder statement

```{r remainder-statement}
x = seq_len(5)
# x

# Modulus
mod = 2

remainder = x %% mod
```


```{r pasting-values}
# `paste0` is useful if you want to control how the string is
# merged together w.r.t spacing and punctation at the end.

# Paste is great if you do not need to add ending punctuation
# as it will automatically add spaces between strings and variables.
```


## Exercise: Counting words in text

<https://factba.se/transcript/donald-trump-remarks-steel-aluminum-tariffs-march-8-2018>

```{r and-we-count}
my_text = "Well thank you very much, everybody. I am honored to be here with our incredible steel and aluminum workers. And you are truly the backbone of America. You know that. Very special people. I've known you and people that are very closely related to you for a long time. You know that. I think it's probably the reason I'm here. So I want to thank you."
```

```{r}
my_text

length(my_text)

original_number_of_chars = nchar(my_text)
tokenization_words = gsub("[[:space:]]", "", my_text)

words_without_spaces = nchar(tokenization_words)

original_number_of_chars - words_without_spaces
```


## Example: Concatenating Strings (Vectorized)

```{r}
subject_ids = seq_len(5)

paste0("S", subject_ids)

paste0("S", subject_ids, sep = "-")

paste0("S", subject_ids, collapse = "")
```

## Example: Substring

```{r}
substr("stat", 1, 2)

substr("Illinois", 4, 8)

substr("coding", 7, 10)

substr(c("stat", "Illinois"), 1:2, 3:4)
```


### Exercise: Transform the first letter in every string to a capital

```{r}
x = c("mumford", "female", "male", "joe", "pete")

# This obtains the first position
substr("mumford", 1, 1)

# Modify the case
first_letter = substr("mumford", 1, 1)

single_word = "mumford"
substr(single_word, 1, 1) = toupper(first_letter)
single_word
```


```{r}

substr(x, 1, 1) = toupper(substr(x, 1, 1))

# Show updated elements
x

paste0(toupper(substr(x, 1, 1)), substr(x, 2, nchar(x))) 
```


## Example: Split String

```{r}
dishes = c("Spaghetti and Meatballs", "French Onion Soup")

movies = "Star Wars, Up!, Monsters Inc., Black Panther"
movies

# This cleans up.
strsplit(movies, split = ", ")

# Note this leaves a space at the end of each character
strsplit(movies, split = ",")
```

```{r}
dishes_extended = c("Spaghetti and Meatballs", "French Onion Soup", "Cabbage Soup", "Corn Beef and Cabbage", "Pizza", "Fried Rice", "chicken noodle soup")
```


# Text Mining

## Example: What's your Favorite Color?

```{r gh-web-api-favorite-color, eval = FALSE}
# GitHub Web API v3
library("gh")      

# Functional programming
library("purrr") 

# Sys.setenv("GITHUB_PAT" = "not-telling")

# What are we obtaining?
owner = "stat385-sp2019"
repo = "disc"
number = 71

# Download issue text
submitters = gh(
    "GET /repos/:owner/:repo/issues/:number/comments",
     owner = owner, repo = repo, 
     number = number, .limit = 1000)

# Extract responses
comment_txt = submitters %>%
    map_chr("body")

# Write to a text file
writeLines(comment_txt, con = "color_data.txt")
```

## Example: Tokenization and Counting

```{r make-tokens-txt}
# Text to analyze
txt = "What's your name? Mine is James."

# Break into tokens
# Notice: Output is a list!
tokens = strsplit(txt, split = " ")
tokens
```


```{r count-words}
# Count words
table(tokens)
```


## Example: Read in Text Data

The data inside of the `color_data.txt` file is given in _line_ form:

```md
Blue
Blue
Blue
blue
Orange
```

This data can be read in using `readLines()`.

```{r text-read-example}
color_responses = readLines("color_data.txt")
color_responses
```

The data here is given in line form with one word per line. Thus, we can immediately count the values or can we?

```{r see-count-color}
table(color_responses)
```

## Example: Pre-processing

```{r pre-process-recipes-lower}
# Convert response to lowercase
lower_resp = 
   tolower(color_responses)
lower_resp
```


```{r pre-process-recipes-empty}
# Remove empty lines
nonempty_resp = 
   lower_resp[
     lower_resp != "" # Empty string
]
```


```{r pre-process-recipes-punct}
# Remove punctation
no_punct = 
   gsub(
      "[[:punct:]]", # regex Identifier   
      "",   # Empty string
      nonempty_resp
    )
```

## Example: Monolithic string

Create a large string
```{r}
combined_str = 
   paste(no_punct,
              collapse = " ")

combined_str
```

Now, let's break the string apart.

```{r}
combined_str_split = 
  strsplit( 
      combined_str, 
      split=" "
  )[[1]]

combined_str_split
```

And count again:

```{r}
table(combined_str_split)
```

## Example: Term-Frequency

```{r term-matrix-example}
# install.packages("tm")
library("tm")
corpus = Corpus(VectorSource(combined_str))
tdm = TermDocumentMatrix(corpus)

inspect(tdm)
```

## Example: Word Cloud

```{r example-word-cloud}
library("wordcloud")

# Obtain counts without going into a DTM.
counts = sort(table(combined_str_split), decreasing = TRUE)

# Construct a wordcloud
wordcloud(names(counts), freq = counts, min.freq=0.01, 
          colors = names(counts), ordered.colors = TRUE)
```


### Exercise: Term-Frequency

What happens to the DTM if we change `combined_str` to `combined_str_split`?

```{r term-matrix-exercise}
corpus = Corpus(VectorSource(combined_str))
tdm = TermDocumentMatrix(corpus)

inspect(tdm)
```