---
title: "Web Scraping"
author: "JJB + Course"
date: "10/29/2018"
output:
html_document:
toc: true
toc_float:
collapse: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# HTML
### Exercise: Understanding HTML
1. Identify all tags and what can be extracted
2. Determine what tags have attributes and what their properties are
```html
Title of Page
First order heading (large)
Paragraph for text with a
link!
Top Beverages
- Tea
- Coffee
- Milkshakes
```
# Pipe Operator
## Example: Piping Operator
```{r}
# install.packages("magrittr")
library("magrittr")
4 %>% # Take the number four and, then
sqrt() # find the square root
# Same as
# sqrt(4)
c(7, 42, 1, 25) %>% # Combine four elements and, then
log() %>% # take the natural log and, then
round(2) %>% # round to the second decimal and, then
diff() # take the difference between consecutive elements
# Same as
# diff(round(log(c(7,42,1,25)), 2))
```
# Scraping Information
Prior to continuing, note that you will need:
```{r}
# install.packages("rvest")
library("rvest")
```
## Example: Reading in a Web Page to R
```{r}
sample_webpage = '
Title of Page
First order heading (large)
Paragraph for text with a
link!
Top Beverages
- Tea
- Coffee
- Milkshakes
Name |
Salary |
Joshua Tree |
66,666 |
Aaron Thomas |
78,921.40 |
'
my_webpage = read_html(sample_webpage)
# Or, grab it from online!
# my_webpage = read_html("http://domain.com/path/to/sample_webpage.html")
# Or, use a local copy.
# my_webpage = read_html("~/Documents/path/to/sample_webpage.html")
my_webpage
```
## Example: Extract Node or Nodes
Process the webpage into _R_.
```{r}
my_webpage = read_html(sample_html)
```
Retrieve **all** instances of the `li` element.
```{r}
my_webpage %>%
html_nodes("li")
```
Retrieve only the first instance of the `li` element.
```{r}
my_webpage %>%
html_node("li")
```
Extract Text from Elements
```{r}
my_webpage %>%
html_nodes("li") %>%
html_text()
```
## Example: Retrieve Attributes
Retrieve only the `httr` attribute information.
```{r}
my_webpage %>%
html_nodes("a") %>%
html_attr("href")
```
Retrieve **all** attributes.
```{r}
my_webpage %>%
html_nodes("a") %>%
html_attrs()
```
## Example: IL State Profile
```{r}
# Read in the Web Page to R
# without saving into the server.
il_profile = read_html("https://illinoiselectiondata.com/elections/ILprofile.php")
il_profile %>%
html_nodes(".myTable-blue") %>%
html_table()
# Select the elements that are tables.
# Extract the contents as a data.frame.
il_tables = il_profile %>%
html_nodes(".myTable-blue") %>%
html_table()
str(il_tables)
il_tables[[2]]
```
## Example: PBS News Hour
```{r}
pbs_url = "https://www.pbs.org/newshour/"
## Note the selector we found is:
# .card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a
pbs_webpage = read_html(pbs_url)
pbs_webpage %>%
html_nodes(".card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a") %>%
html_text()
```
## Example: Google News
```{r}
gnews = read_html("https://news.google.com")
gnews
gnews %>%
html_nodes(".kWyHVd .ME7ew") %>%
html_text()
```
### Exercise: Thomas Crown Affair
Find the top listed stars of [The Thomas Crown Affair](www.imdb.com/title/tt0155267/)
```{r, eval = FALSE}
library("rvest")
# 1. Load in the HTML page into R
imdb_page = # read_html("http://www.imdb.com/title/tt2294629")
# Thomas Crown Affair
# read_html("http://www.imdb.com/title/tt0155267/")
# Frozen movie
# read_html("http://www.imdb.com/title/tt2294629")
# 2. Determine the selectorgadget values
# ...
# 3. Extract the contents
# ...
```
### Exercise: Extract Financial Disclosures
Extract the financial information from the Open Secrets campaign initiative.
Hint: You will need to use `html_table(x)`
```{r}
# Selector
## td , .number, .no-sort
library(rvest)
opensecrets = read_html("https://www.opensecrets.org/races/election?id=IL")
opensecrets
opensecrets %>%
html_nodes("table")%>%
html_table(fill = TRUE)
```
# Advanced Web Scraping
## Example: Generalizing Output
Grabbed HTML Output
```html
Kristen Bell
|
```
```html
Alan Tudyk
|
```
To generalize, we're aiming to find some attribute on the HTML tag that appears
multiple times. If we can find such an attribute, then we can construct
a CSS selector of `tag[attribute=value]`.
```{r}
# Read in the Movie
imdb_movie = read_html("https://www.imdb.com/title/tt0155267/")
# Create a CSS selector based on two or more HTML attributes.
imdb_movie %>%
html_nodes("td[itemprop=\"actor\"] span[itemprop=\"name\"]") %>%
html_text()
```