• Stars
    star
    209
  • Rank 187,221 (Top 4 %)
  • Language
    Python
  • Created about 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

Gutenberg, dammit

By Allison Parrish

Gutenberg, dammit is a corpus of every plaintext file in Project Gutenberg (up until June 2016), organized in a consistent fashion, with (mostly?) consistent metadata. The intended purpose of the corpus is to make it really easy to do creative things with this wonderful and amazing body of freely-available text.

Download the corpus here.

The name of the corpus was inspired by Leonard Richardson's Unicode, dammit.

Code in this repository relies on the data prepared by the GutenTag project (Brooke 2015) and the code is partially based on the GutenTag source code.

NOTE: Not all of the works in Project Gutenberg are in the public domain. Check the Copyright Status field in the metadata for each work you plan on using to be sure. I believe that all of the files in the corpus are redistributable, but it might not be okay for you to "reuse" any works in the corpus that are not in the public domain.

Working with the corpus

The gutenbergdammit.ziputils module has some functions for working with the corpus file in situ using Python's zipfile library, so you don't even have to decompress the file and make a big mess on your hard drive. You can copy/paste these functions, use them as a reference in your own implementation, or use them directly by installing this package from the repo:

pip install https://github.com/aparrish/gutenberg-dammit/archive/master.zip

First, download the ZIP archive and put it in the same directory as your Python code. Then, to (e.g.) retrieve the text of one particular file from the corpus:

>>> from gutenbergdammit.ziputils import retrieve_one
>>> text = retrieve_one("gutenberg-dammit-files-v002.zip", "123/12345.txt")
>>> text[:50]
'[Illustration: "I saw there something missing from'

To retrieve the metadata file:

>>> from gutenbergdammit.ziputils import loadmetadata
>>> metadata = loadmetadata("gutenberg-dammit-files-v002.zip")
>>> metadata[456]['Title']
['Essays in the Art of Writing']

To search for and retrieve files whose metadata contains particular strings:

>>> from gutenbergdammit.ziputils import searchandretrieve
>>> for info, text in searchandretrieve("gutenberg-dammit-files-v002.zip", {'Title': 'Made Easy'}):
...     print(info['Title'][0], len(text))
...
Entertaining Made Easy 108314
Reading Made Easy for Foreigners - Third Reader 209964
The Art of Cookery Made Easy and Refined 262990
Shaving Made Easy	What the Man Who Shaves Ought to Know 44982
Writing and Drawing Made Easy, Amusing and Instructive	Containing The Whole Alphabet in all the Characters now	us'd, Both in Printing and Penmanship 10036
Etiquette Made Easy 119770

Details

The corpus is arranged as multiple subdirectories, each with the first three digits of the number identifying the Gutenberg book. Plain text files for each book whose ID begins with those digits are located in that directory. For example, the book with Gutenberg ID 12345 has the relative path 123/12345.txt. This path fragment is present in the metadata for each file as the gd-path attribute; see below for more details. (Splitting up the files like this is intended to be a compromise that makes accessing each file easy while making life a little bit easier if you're poking around with your file browsing application or ls.)

The files themselves have had Project Gutenberg boilerplate headers and footers stripped away for your convenience. (The code used to strip the boilerplate is copied from GutenTag.) You may want to do your own sanity check on individual files of importance to guarantee that they have the contents you think they should have.

Metadata

The gutenberg-metadata.json file in the zip is a big JSON file with metadata on each book. The is a list of JSON objects with the following format:

{
    "Author": [ "Robert Carlton Brown" ],
    "Author Birth": [ 1886 ],
    "Author Death": [ 1959 ],
    "Author Given": [ "Robert Carlton" ],
    "Author Surname": [ "Brown" ],
    "Copyright Status": [ "Not copyrighted in the United States." ],
    "Language": [ "English" ],
    "LoC Class": [ "SF: Agriculture: Animal culture" ],
    "Num": "14293",
    "Subject": [ "Cookery (Cheese)", "Cheese" ],
    "Title": [ "The Complete Book of Cheese" ],
    "charset": "iso-8859-1",
    "gd-num-padded": "14293",
    "gd-path": "142/14293.txt",
    "href": "/1/4/2/9/14293/14293_8.zip"
}

The capitalized fields correspond to the fields in the official Project Gutenberg metadata, with information about the author broken out into the birth/death/given/surname fields when possible. Fields are presented as lists to accommodate books that (e.g.) have more than one author or title.

The lower-case fields are metadata specific to this corpus, explained below:

  • charset: The character set of the original file. All of the files in the ZIP are in UTF-8 encoding, so this is only helpful if (e.g.) you're using the metadata to refer back to the original file on the Gutenberg website.
  • gd-num-padded: The book number ("Gutenberg ID") left-padded to five digits with zeros.
  • gd-path: The path to the file inside the Gutenberg Dammit zip file, to be appended to the gutenberg-dammit-files/ directory present in the zip file itself.
  • href: The path to the file in the original GutenTag corpus.

NOTE: Not all records have every field, and not every field is guaranteed to be non-empty.

What was included, what was left out

First off, Gutenberg, dammit is based on files from Project Gutenberg, and doesn't include files from any of the related international projects (e.g. Project Gutenberg Canada, Project Gutenberg Australia).

Only Gutenberg items with plaintext files are included in this corpus. It doesn't include audiobooks, and it doesn't include any books only available in text formats other than plaintext (e.g., PDF or HTML).

In some cases, documents that are primarily available in some non-plaintext format will include a "stub" text file that just tells the reader to look at the other file. No attempt has been made to systematically exclude these from the present corpus.

Project Gutenberg includes a number of documents with content that is offensive. Given their possible academic and historical value, no effort has been made to systematically exclude these documents from this corpus. Please take care when including such documents (and portions thereof) in any analysis or creative reinterpretations. Just because a book is in the public domain doesn't mean you always have a right to use its words.

Character encodings

The included text files are all encoded as UTF-8. When decoding from Project Gutenberg, decoding is first attempted using the encoding declared in the file's metadata; if that decoding doesn't work, chardet's detect function is used to determine the most likely encoding, and that encoding is used instead. If Python still raises an error when attempting to decode using chardet's guess, ISO-8859-1 is tried as a last resort. If none of this worked, then the file is left out of the archive.

How to Gutenberg, dammit from scratch

If you just want to use the corpus, don't bother with any of the content that follows. If you want to be able to recreate the process of how I made the corpus, read on.

The scripts in this repository work on the files prepared by GutenTag. In order to use the scripts, you'll need to download their corpus ("Our (full) Project Gutenberg Corpus", ~7 GB ZIP file) and unzip it into a directory on your system.

The included package gutenbergdammit/build.py is designed to be used as a command-line script. Run it on the command line like so:

python -m gutenbergdammit.build --src-path=<path to your gutentag download> \
    --dest-path=output --metadata-file=output/gutenberg-metadata.json \

Help on the options:

Usage: build.py [options]

Options:
-h, --help            show this help message and exit
-s SRC_PATH, --src-path=SRC_PATH
                        path to GutenTag dump
-d DEST_PATH, --dest-path=DEST_PATH
                        path to output (will be created if it doesn't exist)
-m METADATA_FILE, --metadata-file=METADATA_FILE
                        path to metadata file for output (will be overwritten)
-l LIMIT, --limit=LIMIT
                        limit to n entries (good for testing)
-o OFFSET, --offset=OFFSET
                        start at index n (good for testing)

The --limit and --offset options are not required, and, if omitted, the tool will default to processing the entire archive.

Notes on implosion

Python's zipfile module doesn't support the compression algorithm used on some of the files in the Gutenberg archive ("implosion"). Whoops. Included in the repository is a script that unzips and re-zips these files using a modern compression algorithm. To run it:

python -m gutenbergdammit.findbadzips --src-path=<gutentag_dump> --fix

This will modify the ~100 files in your GutenTag dump with broken ZIP compression, and save copies of the originals (with -orig at the end of the filename). Leave off --fix to do a dry run (i.e., just show which files are bad, don't fix them).

To use this script, you'll need to have the zip and unzip binaries on your system and in your path. It also probably assumes UNIX-ey paths (i.e., separated with slashes), but a lot of stuff in here does. Pull requests welcome.

Next steps

  • Rework this process so it can construct a similarly-organized archive starting with a straight-up mirror of Project Gutenberg (rather than the GutenTag corpus, which is a combination of the 2010 DVD ISO and I think more recent entries collected via web scraping?)
  • Implement a process for adding newer files to the corpus (by looking at the RSS feed?)
  • Make the corpus zip file into a torrent or something so I'm not paying for every download

Works cited

Brooke, Julian, et al. “GutenTag: An NLP-Driven Tool for Digital Humanities Research in the Project Gutenberg Corpus.” CLfL@ NAACL-HLT, 2015, pp. 42–47.

Version history

  • v0.0.2 (2018-08-11): Fixed character encoding problems and released new version of the archive with resulting encodings
  • v0.0.1 (2018-08-10): Initial release.

License

In accordance with GutenTag's license:

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.

More Repositories

1

pronouncingpy

A simple interface for the CMU pronouncing dictionary
Python
300
star
2

pytracery

Python port of Kate Compton's Tracery text expansion library.
Python
252
star
3

rwet

Notebooks and other materials for Reading and Writing Electronic Text
Jupyter Notebook
199
star
4

gutenberg-poetry-corpus

A corpus of poetry from Project Gutenberg
Jupyter Notebook
186
star
5

phonetic-similarity-vectors

Source code to accompany my paper "Poetic sound similarity vectors using phonetic features"
Jupyter Notebook
166
star
6

pycorpora

A simple Python interface for Darius Kazemi's Corpora Project.
Python
119
star
7

everywordbot

a simple script for creating @everyword-like twitter services
Python
114
star
8

rwet-examples

Reading and Writing Electronic Text Example Code
Python
95
star
9

pincelate

Easy to use ML model for spelling and sounding out words
Jupyter Notebook
89
star
10

seaduck

A bare-bones simulation-driven narrative framework
JavaScript
86
star
11

dmep-python-intro

Jupyter Notebook
82
star
12

pocket-sp

Design files for my Pocket SP Game Boy mod
73
star
13

pronouncingjs

a simple javascript interface to the CMU pronouncing dictionary (for node and browser!)
JavaScript
69
star
14

text-resources

I have this big list of links to text stuff that I like, so I thought I'd make it into a repository.
67
star
15

material-of-language

Notes and notebooks for Material of Language
Jupyter Notebook
63
star
16

gen-text-workshop

Tutorials, resources and links on generative text.
50
star
17

simpleneighbors

A clean and easy interface for performing nearest-neighbor lookups
Python
50
star
18

nb5js-proof-of-concept

proof of concept for a p5js-specific notebook interface
JavaScript
48
star
19

example-twitter-bot-node

Example Twitter Bot(s) for node.js
JavaScript
41
star
20

sfpc-gen-text-2015

notes and links for generative text workshop at SFPC 2015
40
star
21

nonsense-verse-pycon-2020

Materials for PyCon 2020 Workshop, "Nonsense verse... with Python and machine learning"
Jupyter Notebook
30
star
22

corpus-driven-narrative-generation

Thoughts toward and tutorial on corpus-driven narrative generation
Jupyter Notebook
22
star
23

comexmadivla

Computational exploration of magical and divinatory language
Jupyter Notebook
21
star
24

word-gan-book-generator

Generating books from GANs trained on bitmaps of whole words
Jupyter Notebook
21
star
25

plot-to-poem

"Translate" a plot from Mark Riedl's WikiPlots corpus into a poem. For NaPoGenMo 2017.
Jupyter Notebook
20
star
26

semanticsimilaritychatbot

A tiny unfussy corpus-driven chatbot based on semantic similarity
Python
20
star
27

nanogenmo2014

My NaNoGenMo project for 2014
TeX
18
star
28

bezmerizing

a tiny quirky library with some bezier curve functions
Jupyter Notebook
17
star
29

nanogenmo2015

repository for "our arrival," my nanogenmo 2015 project
Python
16
star
30

linear-lsystem-poetry

a strange textual interface for making poetry with l-systems
JavaScript
15
star
31

predictive-text-and-text-generation

rwet example for binder
Jupyter Notebook
15
star
32

universal-sentence-encoder-xling-runway

Wrapper for Universal Sentence Encoder for use in Runway
Python
12
star
33

mydinosaur

A fun library for bot makers to create RSS feeds for their bots.
Python
12
star
34

eroft

Electronic Rituals, Oracles and Fortune Telling
Jupyter Notebook
11
star
35

fullwidth

A keyboard layout file for OSX that turns your keystrokes into  full-width Unicode characters.
10
star
36

nanogenmo2017

The Average Novel
Jupyter Notebook
9
star
37

iceboxbreakfast

A William Carlos Williams-parodizing Mastodon bot
Python
9
star
38

plaintext-example-files

just a lil collection of plaintext example files
8
star
39

word-dcgan

why not train a gan on bitmaps of random words? what's the worst that could happen
Python
8
star
40

bobey-dig

Moby Dick with a head cold (for NaNoGenMo 2019)
Jupyter Notebook
7
star
41

xterm-ansi-bundle

xterm.js and ansi-escape-sequences bundled for browser use
JavaScript
7
star
42

tracery_kernel

A barebones custom Jupyter kernel for Tracery
Jupyter Notebook
6
star
43

wordfreq-en-25000

quick and dirty dump of 25k English words from wordfreq
Python
6
star
44

processing.py-workshop-examples

Example code for "Introduction to processing.py" workshop given at NYU/ITP, June 2011
Python
5
star
45

mimetree

Last Baby Standing: a Facebook game where you mate your friends together to breed the perfect, space fungus-resistant child. ("Mimetree" was the project code name.)
Python
5
star
46

cashclones

a twitter bot for making alternate history scenarios
Web Ontology Language
4
star
47

characterror

A shmup about spelling
Python
4
star
48

libraryofemoji

Source code for generating emoji names like those seen on @libraryofemoji.
Python
4
star
49

romcomsort

NaNoGenMo 2019 project: sort a bunch of romcoms
Jupyter Notebook
4
star
50

definer-tornado-on-heroku

A web application for randomly defining words. Demonstrates Tornado on Heroku.
Python
4
star
51

treestr

a Python string type that keeps track of its own history and metadata
Python
4
star
52

humanshangingout

can they be deceived? can they be deceived... by a robot?
Python
3
star
53

manipulating-font-data-flat

basic tutorial on using flat's opentype functions
Jupyter Notebook
3
star
54

ezi72ulx

Turn Inform 7 code into .ulx files—fast
Python
3
star
55

shoestrings

markov chain text generation library
Jupyter Notebook
3
star
56

twitteravatarkeyboard

Type using the Twitter avatars of users whose screen names have only one letter.
2
star
57

contentmalleable

a javascript snippet for breaking the contents of a contenteditable up into marked-up words
JavaScript
2
star
58

contextfreegengen

A context-free grammar generator generator, based heavily on Darius Kazemi's GenGen.
HTML
2
star
59

rwet-examples-c

rewriting my RWET examples in C
C
2
star
60

devos-vote

jupyter notebook for democracy
Jupyter Notebook
2
star
61

progdat

Notebooks etc for Programming with Data (NYU ITP)
Jupyter Notebook
2
star
62

sexywindsurfing

automated agent for apples to apples
Python
1
star
63

runway-markov-test

Making a very simple model for Runway
Python
1
star
64

chars74k-json-dump

Convert the stroke trajectories of the hand-drawn English letters in the Chars74k dataset to JSON
Jupyter Notebook
1
star
65

Simple-Flask-Example

A simple web application for word counts in a text file
Python
1
star