--- title: "Web Scraping" author: "JJB + Course" output: html_document: toc: true toc_float: collapsed: false --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # HTML ### Exercise: Understanding HTML 1. Identify all tags and what can be extracted 2. Determine what tags have attributes and what their properties are ```html Title of Page

First order heading (large)

Paragraph for text with a link!

Top Beverages

Tea
Coffee
Milkshakes

``` # Pipe Operator ## Example: Piping Operator ```{r} # install.packages("magrittr") library("magrittr") 4 %>% # Take the number four and, then sqrt() # find the square root # Same as # sqrt(4) c(7, 42, 1, 25) %>% # Combine four elements and, then log() %>% # take the natural log and, then round(2) %>% # round to the second decimal and, then diff() # take the difference between consecutive elements # Same as # diff(round(log(c(7,42,1,25)), 2)) ``` # Scraping Information Prior to continuing, note that you will need: ```{r} # install.packages("rvest") library("rvest") ``` ## Example: Reading in a Web Page to R ```{r} sample_webpage = ' Title of Page

First order heading (large)

Paragraph for text with a link!

Top Beverages

Tea
Coffee
Milkshakes

Name	Salary
Joshua Tree	66,666
Aaron Thomas	78,921.40

' my_webpage = read_html(sample_webpage) # Or, grab it from online! # my_webpage = read_html("http://domain.com/path/to/sample_webpage.html") # Or, use a local copy. # my_webpage = read_html("~/Documents/path/to/sample_webpage.html") my_webpage ``` ## Example: Extract Node or Nodes Process the webpage into _R_. ```{r} my_webpage = read_html(sample_webpage) ``` Retrieve **all** instances of the `li` element. ```{r} my_webpage %>% html_nodes("li") ``` Retrieve only the first instance of the `li` element. ```{r} my_webpage %>% html_node("li") ``` Extract Text from Elements ```{r} my_webpage %>% html_nodes("li") %>% html_text() ``` ## Example: Retrieve Attributes Retrieve only the `httr` attribute information. ```{r} my_webpage %>% html_nodes("a") %>% html_attr("href") ``` Retrieve **all** attributes. ```{r} my_webpage %>% html_nodes("a") %>% html_attrs() ``` ### Exercise: SelectorGadget ```{r} library("rvest") selectorgadget_webpage = read_html("https://selectorgadget.com/") selectorgadget_webpage ``` ```{r} selectorgadget_webpage %>% html_nodes("li:nth-child(1)") ``` ```{r} selectorgadget_webpage %>% html_nodes("li:nth-child(1)") %>% html_text() ``` ## Example: IL State Profile ```{r} # Read in the Web Page to R # without saving into the server. il_profile = read_html("https://illinoiselectiondata.com/elections/ILprofile.php") # Select the elements that are tables. # Extract the contents as a data.frame. il_profile %>% html_nodes(".myTable-blue") %>% html_table() ``` ## Example: PBS News Hour ```{r} pbs_url = "https://www.pbs.org/newshour/" ## Note the selector we found is: # .card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a pbs_webpage = read_html(pbs_url) pbs_webpage %>% html_nodes(".card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a") %>% html_text() ``` ### Exercise: Thomas Crown Affair Find the top listed stars of [The Thomas Crown Affair](www.imdb.com/title/tt0155267/) ```{r, eval = FALSE} library("rvest") # 1. Load in the HTML page into R imdb_page = read_html("http://www.imdb.com/title/tt2294629") # Thomas Crown Affair # read_html("http://www.imdb.com/title/tt0155267/") # Frozen movie # read_html("http://www.imdb.com/title/tt2294629") # 2. Determine the selectorgadget values # td:nth-child(2) a nodes_imdb_page = imdb_page %>% html_nodes("td:nth-child(2) a") nodes_imdb_page ``` ```{r} # 3. Extract the contents actor_names = nodes_imdb_page %>% html_text() actor_names ``` ```{r} library("stringr") str_replace(actor_names, pattern = "^[[:space:]]", replacement = "") %>% str_replace(pattern = "\\n$", replacement = "") ``` ### Exercise: Extract Financial Disclosures Extract the financial information from the Open Secrets campaign initiative. Hint: You will need to use `html_table(x)` ```{r} library("rvest") # Step 1: Identify Content # # Table information containing the money spent in the house election # # td , .number, .no-sort # Step 2: Read in the HTML house_il_web = read_html("https://www.opensecrets.org/races/election?id=IL") # Step 3: Extract out the element/node on the page: finance_info = house_il_web %>% html_nodes("div.table-wrap.u-mt2 > table") # Step 4: Extract the content finance_info %>% html_table(fill = TRUE) ``` ## Example: Scraping E-Mail Addresses Over the last year, the Faculty and Staff on the UIUC campus has been targeted with phishing e-mails that have resulted in security lapses. These e-mails spoof an official email to have the victim provide their credentials. One issue that arose is how are the UIUC e-mails being obtained? Well... Let's take a look! ```{r} library("rvest") # Download and read into R the grad student directory stat_grads = read_html("https://stat.illinois.edu/directory/grad-students") # Extract the nodes with URLs to student profiles. # Retrieve the href property, take the last portion of the URI, # and append @illinois.edu to form the e-mail address. stat_grads %>% html_nodes("div.directory__name-link > a") %>% html_attr("href") %>% basename() %>% paste0("@illinois.edu") ``` ### Exercise: Generalizing a Scraping Routine Write a function that generalizes the web scraping procedure. Apply the function to the History and Math departments: - https://history.illinois.edu/directory/faculty - https://math.illinois.edu/directory/faculty ```{r} scrap_las = function(uri) { # Download and read into R the grad student directory directory = read_html(uri) # Extract the nodes with URLs to student profiles. # Retrieve the href property, take the last portion of the URI, # and append @illinois.edu to form the e-mail address. directory %>% html_nodes("div.directory__name-link > a") %>% html_attr("href") %>% basename() %>% paste0("@illinois.edu") } scrap_las("https://history.illinois.edu/directory/faculty") scrap_las("https://math.illinois.edu/directory/faculty") scrap_las("https://sociology.illinois.edu/directory/faculty") ``` # Advanced Web Scraping ## Example: Generalizing Output Grabbed HTML Output ```html Kristen Bell ``` ```html Alan Tudyk ``` To generalize, we're aiming to find some attribute on the HTML tag that appears multiple times. If we can find such an attribute, then we can construct a CSS selector of `tag[attribute=value]`. ```{r} # Read in the Movie imdb_movie = read_html("https://www.imdb.com/title/tt0155267/") # Create a CSS selector based on two or more HTML attributes. imdb_movie %>% html_nodes("td[itemprop=\"actor\"] span[itemprop=\"name\"]") %>% html_text() ``` ## Example: Google News Consider the news aggregation service provided by ```{r gnews-get-data} gnews = read_html("https://news.google.com") gnews ``` Unfortunately, if we use the selector gadget approach, we'll run into an issue because the news stories are "refreshed" causing each entry to have a unique identifier that disappears once new stories are added. As a result, the following will yield no results as the selector IDs are too old. ```{r gnews-id-keyed} gnews %>% html_nodes(".kWyHVd .ME7ew") %>% html_text() ``` Instead, one can observe the structure of a single news entry ```html

Story title

Story body text

News Source

1 Hour ago

``` ```{r} gnews %>% html_nodes("article > h3 > a") %>% html_text() ```