Baseball Databank
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This work is licensed by Chadwick Baseball Bureau under the Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see http://creativecommons.org/licenses/by-sa/3.0/
About this data
- This is a legacy resource. Data in this format has been circulated by various people for many years, and there are many applications and users who have tools which take data in this format. It is maintained by Chadwick Baseball Bureau to support compatibility with those tools and programs. As such, the schema is not open to amendments, either in terms of the scope of coverage or in terms of the data categories available.
- This is a free resource. Statistical data will be updated once at some point during the MLB offseason. To borrow the slogan used by ProMods, "It's ready when it's ready." New releases will be announced via our Twitter account at @chadwickbureau. We, politely, will not be able to respond to any enquiries as to when new versions of the data will be released.
- These data are maintained wholly by Chadwick Baseball Bureau, for the benefit of the community. Users who require data of a different scope, in a different format, and/or with more specific schedules for updates are encouraged to enquire about our various licensing options.
Using or citing this data
We repeat, this is a legacy resource intended for backwards compatibility only. It is suitable for casual or exploratory use, as a convenient dataset for students to practice their data skills, and so forth.
It is not suitable for use as the basis for any kind of publication. The legacy parts of this data are not maintained, most likely contain errors, and definitely do not reflect many of the latest revisions to the historical record.
Researchers wanting a dataset that is suitable for research or publication purposes should contact Chadwick Baseball Bureau for enquiries.
Organisation of the files
There are three directories in the repository.
core/
contains the databank itself. These files are automatically produced from our larger dataset.contrib/
contains files which are manually maintained by others using the same identifier system as the core. We bundle these for the convenience of the community.upstream/
contains files used to construct the databank.
Maintenance and sources
Most of the data in the Databank is provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com). The data differ from the data the Bureau provides to its clients in that it contains less detail, is updated less frequently, and is provided on an as-is basis.
The Databank is historically based in part on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
The tables Parks.csv
and HomeGames.csv
are based on the game logs
and park code table published by Retrosheet.
This information is available free of charge from and is copyrighted
by Retrosheet. Interested parties may contact Retrosheet at
http://www.retrosheet.org.
Enquiries and suggested revisions
Enquiries and suggested revisions to the data can be posted in the issue tracker at https://github.com/chadwickbureau/baseballdatabank/issues.
Files in core/
are all generated by scripts. As such they are not edited manually
(and therefore pull requests should not be submitted against these files).
Files in upstream/
are manually-maintained files which contain information specific
to constructing the Databank. As they are maintained manually, it is valid to submit
pull requests containing corrections or additions to these files.