• Stars
    star
    1,208
  • Rank 38,795 (Top 0.8 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)

pybaseball

Baseball data scraping and analysis tools in python

Overview

pybaseball is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don't have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the docs for a comprehensive list of data acquisition functions.

Installation

Pybaseball can be installed via pip:

pip install pybaseball

or from the repo (which may at times be more up to date):

git clone https://github.com/jldbc/pybaseball
cd pybaseball
pip install -e .

We will try to publish periodic updates through the 'releases' and PyPI CI, but it may lag at times.

Community

Discussion about pybaseball use and development is hosted on our group Discord, sign up link here. Issues with the codebase should still be raised and addressed on GitHub.

Documentation

Full documentation on available functions and their arguments along with examples is located docs folder. This section contains a brief overview of the main functionalities of this library.

Statcast: Pull advanced metrics from Major League Baseball's Statcast system

Statcast data include pitch-level information, pulled from baseballsavant.com.

>>> from pybaseball import statcast
>>> statcast(start_dt="2019-06-24", end_dt="2019-06-25").columns
Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
       'woba_value', 'woba_denom', 'babip_value', 'iso_value',
       'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',
       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
       'post_home_score', 'post_bat_score', 'post_fld_score',
       'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
       'delta_home_win_exp', 'delta_run_exp'],
      dtype='object')

For documentation on the definitions of these columns, see the Statcast Search CSV Documentation.

If start_dt and end_dt are supplied, it will return all statcast data between those two dates. If not, it will return yesterday's data. The optional argument verbose will control whether the library updates you on its progress while it pulls the data.

Player-Specific Queries

For a player-specific statcast query, pull pitching or batting data using the statcast_pitcher and statcast_batter functions. These take the same start_dt and end_dt arguments as the statcast function, as well as a player_id argument. This ID comes from MLB Advanced Media, and can be obtained using the function playerid_lookup. The returned columns match the set above, but filtered to rows for that specific pitcher or batter. A complete example:

# Find Clayton Kershaw's player id
from pybaseball import  playerid_lookup
from pybaseball import  statcast_pitcher
playerid_lookup('kershaw', 'clayton')
  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last
0   kershaw    clayton     477132  kersc001  kershcl01           2036            2008.0           2022.0

# His MLBAM ID is 477132, so we feed that as the player_id argument to the following function 
kershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132)
kershaw_stats.groupby("pitch_type").release_speed.agg("mean")
pitch_type
CH    86.725000
CU    73.133333
FF    92.844622
SI    94.515385
SL    87.962381
Name: release_speed, dtype: float64

A note on Statcast data

Statcast data is subject to change (even for prior seasons):

Each season has 700,000+ pitches, and is subject to update. You should code accordingly.

β€” Tangotiger (@tangotiger) February 17, 2021

Aggregate Statistics

For league-wide season-level pitching data, use the function pitching_stats(start_season, end_season). This will return one row per player per season, and provide all metrics made available by FanGraphs.

For a fixed range, pitching_stats_range(start_dt, end_dt) pulls data for a specific time-interval from Baseball Reference. Note that all dates should be in YYYY-MM-DD format.

from pybaseball import pitching_stats
data = pitching_stats(2014,2016)
data.columns
Index(['IDfg', 'Season', 'Name', 'Team', 'Age', 'W', 'L', 'WAR', 'ERA', 'G',
       ...
       'LA', 'Barrels', 'Barrel%', 'maxEV', 'HardHit', 'HardHit%', 'Events',
       'CStr%', 'CSW%', 'xERA'],
      dtype='object', length=334)

Batting stats are obtained similarly. The function call for getting a season-level stats is batting_stats(start_season, end_season), and for a particular time range it is batting_stats_range(start_dt, end_dt). The Baseball Reference equivalent for season-level data is batting_stats_bref(season).

(For season level queries, if you prefer Baseball Reference to FanGraphs, there is a third option, pitching_stats_bref(season). This works the same as pitching_stats, but retrieves its data from Baseball Reference instead. This is not recommended, however, because the Baseball Reference query currently can only retrieve one season's worth of data per request.)

Game-by-Game Results and Schedule

The schedule_and_record function returns a team's game-by-game results for a given season. The function's only two arguments are season and team, where team is the team's abbreviation (i.e. NYY for New York Yankees).

# Example: Say we want to know the 1927 Yankees record on May 16 
from pybaseball import schedule_and_record
data = schedule_and_record(1927, 'NYY')
data.loc[data.Date.str.contains("May 16"), :]
              Date   Tm Home_Away  Opp W/L    R   RA  Inn   W-L  Rank      GB      Win      Loss   Save  Time D/N  Attendance   cLI  Streak Orig. Scheduled
28  Monday, May 16  NYY         @  DET   W  6.0  2.0  9.0  19-8   1.0  up 3.0  Ruether  Holloway  Moore  2:28   D      4000.0  5.15       5            None

Standings: up to date or historical division standings, W/L records

The standings(season) function gives division standings for a given season. If the current season is chosen, it will give the most current set of standings. Otherwise, it will give the end-of-season standings for each division for the chosen season. This function returns a list of dataframes. Each dataframe is the standings for one of MLB's six divisions.

>>> from pybaseball import standings
>>> data = standings(2016)[4]
>>> print(data)
                    Tm    W   L  W-L%    GB
1         Chicago Cubs  103  58  .640    --
2  St. Louis Cardinals   86  76  .531  17.5
3   Pittsburgh Pirates   78  83  .484  25.0
4    Milwaukee Brewers   73  89  .451  30.5
5      Cincinnati Reds   68  94  .420  35.5

Caching

To facilitate faster data retrieval for repeated calls, a local data cache may be used to save a local copy of the requested data. By default the cache is disabled so as to respect a user's potential desire to not have their hard drive space used without their permission. However, enabling the cache is simple.

Cache can be turned on by including the pybaseball.cache module and enabling the cache option like so:

from pybaseball import cache

cache.enable()

FAQ

Stale Cache

If you call a statcast method for a future date, the cache will log empty datasets for those dates. If you're not getting the results you expect for a given date, first try clearing your cache:

from pybaseball import cache
cache.purge()

Multiprocessing

If you're getting a error with concurrent.futures.process.BrokenProcessPool, wrap your call in a main function, e.g.

if __name__ == '__main__':
    stats = statcast()

This may be necessary on systems that use spawn-based processes (often Windows and OSX).

For other problems, please submit an issue.

Contributing

See contributing.md for a guide to contributing to this library.


Credit

This package was developed by James LeDoux and is maintained by Moshe Schorr.

This package was inspired by Bill Petti's excellent R package baseballr, which at the time of this package's development had no Python equivalent. Our hope is to fill that void with this package.

The Lahman data comes from Sean Lahman's baseball database.

All other data comes from FanGraphs, Baseball Reference, the Chadwick Bureau, Retrosheet, and Baseball Savant.

More Repositories

1

coffee-quality-database

Building the Coffee Quality Institute Database
R
226
star
2

bandits

Multi-Armed Bandit algorithms applied to the MovieLens 20M dataset
Python
52
star
3

Tensorflow_ML_Algorithms

Implementations of machine learning algorithms in Tensorflow: MLP, RNN, autoencoder, PageRank, KNN, K-Means, logistic regression, and OLS regression
Python
52
star
4

gutenberg

A content-based recommender system for books using the Project Gutenberg text corpus
Python
28
star
5

numpy_neural_net

A simple neural network (multilayer perceptron) with backpropagation implemented in Python with NumPy
Python
27
star
6

gunsandcrime

A replication of Marvell and Moody's economics experiment measuring impact of gun ownership on crime rates, using percent suicides by gun, gun manufacturing, and survey data as proxies for gun ownership. Data set included.
Stata
10
star
7

field-goal-models

Modeling NFL Field Goal Probabilities in R
R
9
star
8

twitter-social-graph

Project to visualize a user's Twitter social graph
Python
5
star
9

Sports-Econometrics

Analytics Projects from Sports Econometrics (EC3700) -- a course on advanced methods in cross-sectional econometrics with a focus on sports data
Stata
5
star
10

AuctionHouse

See how much advertisers are paying for your attention https://chrome.google.com/webstore/detail/auctionhouse/hmjofiljabjmompfgllkpkbkfbpbpkcp
JavaScript
5
star
11

iPython-Notebooks

A collection of small side projects and analyses
Jupyter Notebook
4
star
12

malicious-urls

Malicious url classifier build with SVM, random forest, and logistic regression classifiers
Jupyter Notebook
3
star
13

Saber

Misc. sabermetric and sports analytics projects
Jupyter Notebook
3
star
14

boston_college_webcams

Pull photos from the Boston College webcams
Python
2
star
15

Statistical-Learning

Coursework from Big Data (EC3389) -- a course on statistical learning theory with applications in Python
Jupyter Notebook
2
star
16

Big-Data

Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
Python
2
star
17

Udacity-ML

Coursework from Udacity Machine Learning Engineer nanodegree
HTML
1
star
18

groupme-analytics

Data Mine your Group Chat
Jupyter Notebook
1
star
19

NewsBot

Computer generated news headlines using Markov chains
Python
1
star