---
title: "Web Scraping"
author: "JJB + Course"
output:
html_document:
toc: true
toc_float:
collapsed: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# HTML
### Exercise: Understanding HTML
1. Identify all tags and what can be extracted
2. Determine what tags have attributes and what their properties are
```html
Title of Page
First order heading (large)
Paragraph for text with a
link!
Top Beverages
- Tea
- Coffee
- Milkshakes
```
# Pipe Operator
## Example: Piping Operator
```{r}
# install.packages("magrittr")
library("magrittr")
4 %>% # Take the number four and, then
sqrt() # find the square root
# Same as
# sqrt(4)
c(7, 42, 1, 25) %>% # Combine four elements and, then
log() %>% # take the natural log and, then
round(2) %>% # round to the second decimal and, then
diff() # take the difference between consecutive elements
# Same as
# diff(round(log(c(7,42,1,25)), 2))
```
# Scraping Information
Prior to continuing, note that you will need:
```{r}
# install.packages("rvest")
library("rvest")
```
## Example: Reading in a Web Page to R
```{r}
sample_webpage = '
Title of Page
First order heading (large)
Paragraph for text with a
link!
Top Beverages
- Tea
- Coffee
- Milkshakes
Name |
Salary |
Joshua Tree |
66,666 |
Aaron Thomas |
78,921.40 |
'
my_webpage = read_html(sample_webpage)
# Or, grab it from online!
# my_webpage = read_html("http://domain.com/path/to/sample_webpage.html")
# Or, use a local copy.
# my_webpage = read_html("~/Documents/path/to/sample_webpage.html")
my_webpage
```
## Example: Extract Node or Nodes
Process the webpage into _R_.
```{r}
my_webpage = read_html(sample_webpage)
```
Retrieve **all** instances of the `li` element.
```{r}
my_webpage %>%
html_nodes("li")
```
Retrieve only the first instance of the `li` element.
```{r}
my_webpage %>%
html_node("li")
```
Extract Text from Elements
```{r}
my_webpage %>%
html_nodes("li") %>%
html_text()
```
## Example: Retrieve Attributes
Retrieve only the `httr` attribute information.
```{r}
my_webpage %>%
html_nodes("a") %>%
html_attr("href")
```
Retrieve **all** attributes.
```{r}
my_webpage %>%
html_nodes("a") %>%
html_attrs()
```
### Exercise: SelectorGadget
```{r}
library("rvest")
selectorgadget_webpage = read_html("https://selectorgadget.com/")
selectorgadget_webpage
```
```{r}
selectorgadget_webpage %>%
html_nodes("li:nth-child(1)")
```
```{r}
selectorgadget_webpage %>%
html_nodes("li:nth-child(1)") %>%
html_text()
```
## Example: IL State Profile
```{r}
# Read in the Web Page to R
# without saving into the server.
il_profile = read_html("https://illinoiselectiondata.com/elections/ILprofile.php")
# Select the elements that are tables.
# Extract the contents as a data.frame.
il_profile %>%
html_nodes(".myTable-blue") %>%
html_table()
```
## Example: PBS News Hour
```{r}
pbs_url = "https://www.pbs.org/newshour/"
## Note the selector we found is:
# .card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a
pbs_webpage = read_html(pbs_url)
pbs_webpage %>%
html_nodes(".card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a") %>%
html_text()
```
### Exercise: Thomas Crown Affair
Find the top listed stars of [The Thomas Crown Affair](www.imdb.com/title/tt0155267/)
```{r, eval = FALSE}
library("rvest")
# 1. Load in the HTML page into R
imdb_page = read_html("http://www.imdb.com/title/tt2294629")
# Thomas Crown Affair
# read_html("http://www.imdb.com/title/tt0155267/")
# Frozen movie
# read_html("http://www.imdb.com/title/tt2294629")
# 2. Determine the selectorgadget values
# td:nth-child(2) a
nodes_imdb_page = imdb_page %>%
html_nodes("td:nth-child(2) a")
nodes_imdb_page
```
```{r}
# 3. Extract the contents
actor_names = nodes_imdb_page %>%
html_text()
actor_names
```
```{r}
library("stringr")
str_replace(actor_names, pattern = "^[[:space:]]", replacement = "") %>%
str_replace(pattern = "\\n$", replacement = "")
```
### Exercise: Extract Financial Disclosures
Extract the financial information from the Open Secrets campaign initiative.
Hint: You will need to use `html_table(x)`
```{r}
library("rvest")
# Step 1: Identify Content
#
# Table information containing the money spent in the house election
#
# td , .number, .no-sort
# Step 2: Read in the HTML
house_il_web = read_html("https://www.opensecrets.org/races/election?id=IL")
# Step 3: Extract out the element/node on the page:
finance_info = house_il_web %>%
html_nodes("div.table-wrap.u-mt2 > table")
# Step 4: Extract the content
finance_info %>%
html_table(fill = TRUE)
```
## Example: Scraping E-Mail Addresses
Over the last year, the Faculty and Staff on the UIUC campus has
been targeted with phishing e-mails that have resulted in security lapses.
These e-mails spoof an official email to have the victim
provide their credentials. One issue that arose is how are the UIUC
e-mails being obtained?
Well... Let's take a look!
```{r}
library("rvest")
# Download and read into R the grad student directory
stat_grads = read_html("https://stat.illinois.edu/directory/grad-students")
# Extract the nodes with URLs to student profiles.
# Retrieve the href property, take the last portion of the URI,
# and append @illinois.edu to form the e-mail address.
stat_grads %>%
html_nodes("div.directory__name-link > a") %>%
html_attr("href") %>%
basename() %>%
paste0("@illinois.edu")
```
### Exercise: Generalizing a Scraping Routine
Write a function that generalizes the web scraping procedure.
Apply the function to the History and Math departments:
- https://history.illinois.edu/directory/faculty
- https://math.illinois.edu/directory/faculty
```{r}
scrap_las = function(uri) {
# Download and read into R the grad student directory
directory = read_html(uri)
# Extract the nodes with URLs to student profiles.
# Retrieve the href property, take the last portion of the URI,
# and append @illinois.edu to form the e-mail address.
directory %>%
html_nodes("div.directory__name-link > a") %>%
html_attr("href") %>%
basename() %>%
paste0("@illinois.edu")
}
scrap_las("https://history.illinois.edu/directory/faculty")
scrap_las("https://math.illinois.edu/directory/faculty")
scrap_las("https://sociology.illinois.edu/directory/faculty")
```
# Advanced Web Scraping
## Example: Generalizing Output
Grabbed HTML Output
```html
Kristen Bell
|
```
```html
Alan Tudyk
|
```
To generalize, we're aiming to find some attribute on the HTML tag that appears
multiple times. If we can find such an attribute, then we can construct
a CSS selector of `tag[attribute=value]`.
```{r}
# Read in the Movie
imdb_movie = read_html("https://www.imdb.com/title/tt0155267/")
# Create a CSS selector based on two or more HTML attributes.
imdb_movie %>%
html_nodes("td[itemprop=\"actor\"] span[itemprop=\"name\"]") %>%
html_text()
```
## Example: Google News
Consider the news aggregation service provided by
```{r gnews-get-data}
gnews = read_html("https://news.google.com")
gnews
```
Unfortunately, if we use the selector gadget approach, we'll run into
an issue because the news stories are "refreshed" causing each entry to
have a unique identifier that disappears once new stories are added.
As a result, the following will yield no results as the selector IDs are
too old.
```{r gnews-id-keyed}
gnews %>%
html_nodes(".kWyHVd .ME7ew") %>%
html_text()
```
Instead, one can observe the structure of a single news entry
```html
Story body text
```
```{r}
gnews %>%
html_nodes("article > h3 > a") %>%
html_text()
```