• Stars
    star
    241
  • Rank 167,643 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 7 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tutorial: Web scraping in Python with Beautiful Soup

Web scraping the President's lies in 16 lines of Python

This repository contains the Jupyter notebook and dataset from Data School's introductory web scraping tutorial. All that is required to follow along is a basic understanding of the Python programming language.

By the end of the tutorial, you will be able to scrape data from a static web page using the requests and Beautiful Soup libraries, and export that data into a structured text file using the pandas library.

You can also watch the tutorial on YouTube.

Watch the tutorial on YouTube

Motivation

On July 21, 2017, the New York Times updated an opinion article called Trump's Lies, detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text:

Screenshot of the article

This is a great format for human consumption, but it can't easily be understood by a computer. In this tutorial, we'll extract the President's lies from the New York Times article and store them in a structured dataset.

Screenshot of the DataFrame

Outline of the tutorial

  • What is web scraping?
  • Examining the New York Times article
    • Examining the HTML
    • Fact 1: HTML consists of tags
    • Fact 2: Tags can have attributes
    • Fact 3: Tags can be nested
  • Reading the web page into Python
  • Parsing the HTML using Beautiful Soup
    • Collecting all of the records
    • Extracting the date
    • Extracting the lie
    • Extracting the explanation
    • Extracting the URL
    • Recap: Beautiful Soup methods and attributes
  • Building the dataset
    • Applying a tabular data structure
    • Exporting the dataset to a CSV file
  • Summary: 16 lines of Python code
    • Appendix A: Web scraping advice
    • Appendix B: Web scraping resources
    • Appendix C: Alternative syntax for Beautiful Soup

16 lines of Python code

Just want to see the code? Here it is:

import requests  
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser')  
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []  
for result in results:  
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd  
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])  
df['date'] = pd.to_datetime(df['date'])  
df.to_csv('trump_lies.csv', index=False, encoding='utf-8') 

Want to understand the code? Read the tutorial!

More Repositories

1

scikit-learn-videos

Jupyter notebooks from the scikit-learn video series
Jupyter Notebook
3,663
star
2

pandas-videos

Jupyter notebook and datasets from the pandas video series
Jupyter Notebook
2,143
star
3

scikit-learn-tips

🤖⚡ 50 scikit-learn tips
Jupyter Notebook
1,714
star
4

DAT8

General Assembly's 2015 Data Science course in Washington, DC
Jupyter Notebook
1,602
star
5

DAT4

General Assembly's Data Science course in Washington, DC
Jupyter Notebook
794
star
6

python-reference

Python Quick Reference
Jupyter Notebook
669
star
7

DAT3

General Assembly's Data Science course in Washington, DC
Roff
660
star
8

pycon-2019-tutorial

Data Science Best Practices with pandas
Jupyter Notebook
526
star
9

pycon-2016-tutorial

Machine Learning with Text in scikit-learn
Jupyter Notebook
441
star
10

pycon-2018-tutorial

Using pandas for Better (and Worse) Data Science
Jupyter Notebook
321
star
11

DAT7

General Assembly's Data Science course in Washington, DC
Jupyter Notebook
230
star
12

DAT5

General Assembly's Data Science course in Washington, DC
Jupyter Notebook
185
star
13

dplyr-tutorial

Tutorials for the dplyr package in R
159
star
14

pydata-dc-2016-tutorial

Tutorial: Machine Learning with Text in scikit-learn
Jupyter Notebook
74
star
15

python-data-analysis-workshop

Workshop: Intro to Python for Data Analysis
Python
71
star
16

python-data-science-workshop

Workshop: Python for Data Science
Python
61
star
17

kaggle-allstate

Allstate Purchase Prediction Challenge on Kaggle
R
58
star
18

kaggle-pycon-2015

Solution code from my winning submission to Kaggle's PyCon 2015 competition
Python
55
star
19

tidy-data

Commented R code from Hadley Wickham's "tidy data" presentation
R
29
star
20

PracticalMachineLearning

Course project for Practical Machine Learning: https://www.coursera.org/course/predmachlearn
13
star
21

coursera-getting-data

Class project for Coursera's "Getting and Cleaning Data" class
R
10
star
22

babynames

Baby Names by Birth Year
R
5
star
23

justmarkham

1
star