• Stars
    star
    1,481
  • Rank 31,129 (Top 0.7 %)
  • Language
    R
  • License
    Other
  • Created about 10 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple web scraping for R

rvest rvest website

CRAN status R-CMD-check Codecov test coverage

Overview

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

Installation

# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")

Usage

library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each <section> corresponds
# to a different film
films <- starwars %>% html_elements("section")
films
#> {xml_nodeset (7)}
#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside <h2>
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
#> [1] "The Phantom Menace"      "Attack of the Clones"   
#> [3] "Revenge of the Sith"     "A New Hope"             
#> [5] "The Empire Strikes Back" "Return of the Jedi"     
#> [7] "The Force Awakens"

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id") %>% 
  readr::parse_integer()
episode
#> [1] 1 2 3 4 5 6 7

If the page contains tabular data you can convert it directly to a data frame with html_table():

html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")

html %>% 
  html_element(".tracklist") %>% 
  html_table()
#> # A tibble: 29 Γ— 4
#>    No.   Title                       `Performer(s)`                       Length
#>    <chr> <chr>                       <chr>                                <chr> 
#>  1 1.    "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43  
#>  2 2.    "\"Prologue\""              ""                                   2:28  
#>  3 3.    "\"Emmett's Morning\""      ""                                   2:00  
#>  4 4.    "\"Emmett Falls in Love\""  ""                                   1:11  
#>  5 5.    "\"Escape\""                ""                                   3:26  
#>  6 6.    "\"Into the Old West\""     ""                                   1:00  
#>  7 7.    "\"Wyldstyle Explains\""    ""                                   1:21  
#>  8 8.    "\"Emmett's Mind\""         ""                                   2:17  
#>  9 9.    "\"The Transformation\""    ""                                   1:46  
#> 10 10.   "\"Saloons and Wagons\""    ""                                   3:38  
#> # β„Ή 19 more rows

More Repositories

1

ggplot2

An implementation of the Grammar of Graphics in R
R
6,402
star
2

dplyr

dplyr: A grammar of data manipulation
R
4,675
star
3

tidyverse

Easily install and load packages from the tidyverse
R
1,610
star
4

tidyr

Tidy Messy Data
R
1,347
star
5

purrr

A functional programming toolkit for R
R
1,235
star
6

readr

Read flat files (csv, tsv, fwf) into R
R
998
star
7

magrittr

Improve the readability of R code with the pipe
R
955
star
8

datascience-box

Data Science Course in a Box
JavaScript
908
star
9

reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
R
732
star
10

lubridate

Make working with dates in R just that little bit easier
R
721
star
11

readxl

Read excel files (.xls and .xlsx) into R πŸ–‡
C++
719
star
12

glue

Glue strings to data in R. Small, fast, dependency free interpreted string literals.
R
689
star
13

dtplyr

Data table backend for dplyr
R
656
star
14

tibble

A modern re-imagining of the data frame
R
649
star
15

multidplyr

A dplyr backend that partitions a data frame over multiple processes
R
638
star
16

vroom

Fast reading of delimited files
C++
609
star
17

stringr

A fresh approach to string manipulation in R
R
583
star
18

forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
R
538
star
19

dbplyr

Database (DBI) backend for dplyr
R
466
star
20

haven

Read SPSS, Stata and SAS files from R
C
423
star
21

modelr

Helper functions for modelling
R
399
star
22

googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
R
354
star
23

googledrive

Google Drive R API
R
316
star
24

style

The tidyverse style guide for R code
HTML
290
star
25

design

Tidyverse design principles
R
211
star
26

tidyverse.org

Source of tidyverse.org
HTML
189
star
27

hms

A simple class for storing time-of-day values
R
137
star
28

nycflights13

An R data package containing all out-bound flights from NYC in 2013 + useful metdata
R
124
star
29

tidyversedashboard

Tidyverse activity dashboard
R
71
star
30

tidy-dev-day

Tidyverse developer day
60
star
31

tidyeval

A guide to tidy evaluation
CSS
54
star
32

dsbox

Companion R package to Data Science Course in a Box
R
48
star
33

tidytemplate

A pkgdown template for core tidyverse packages
SCSS
46
star
34

blob

A simple S3 class for representing BLOBs
R
44
star
35

code-review

32
star
36

funs

Collection of low-level functions for working with vctrs
R
31
star
37

website-analytics

Web analytics for tidyverse + r-lib sites
R
28
star
38

tidyups

21
star
39

ggplot2-docs

ggplot2 documentation. Auto-generated from ggplot2 sources by pkgdown
HTML
10
star