• Stars
    star
    119
  • Rank 297,930 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 9 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple Python interface for Darius Kazemi's Corpora Project.

pycorpora

https://travis-ci.org/aparrish/pycorpora.svg?branch=master

A simple Python interface for Darius Kazemi's Corpora Project, "a collection of static corpora (plural of 'corpus') that are potentially useful in the creation of weird internet stuff." The pycorpora interface makes it easy to use data from the Corpora Project in your program. Here's an example of how it works:

import pycorpora
import random

# print a random flower name
print random.choice(pycorpora.plants.flowers['flowers'])

# print a random word coined by Shakespeare
print random.choice(pycorpora.words.literature.shakespeare_words['words'])

Allison Parrish created the pycorpora interface. The source code for the package is on GitHub. Contributions are welcome!

Installation

Installation by hand:

python setup.py install

Installation with pip:

pip install --no-cache-dir pycorpora

The package does not include data from the Corpora Project; instead, the data is downloaded when the package is installed (using either of the methods above). By default, the "master" branch of the Corpora Project GitHub repository is used as the source for the data. You can specify an alternative URL to download the data from using the argument --corpora-zip-url on the command line with either of the two methods above:

python setup.py install --corpora-zip-url=https://github.com/dariusk/corpora/archive/master.zip

... or, with pip:

pip install pycorpora --install-option="--corpora-zip-url=https://github.com/dariusk/corpora/archive/master.zip"

Alternatively, the CORPORA_ZIP_URL environment variable can be used for the same purpose (if both are set, the command line option will take precedence):

env CORPORA_ZIP_URL=https://github.com/dariusk/corpora/archive/master.zip pip install pycorpora

(The intention of --corpora-zip-url is to let you install Corpora Project data from a particular branch, commit or fork, so that changes to the bleeding edge of the project don't break your code. Also, a file:// URL can be used for a local/vendored zip file.)

Update

Update Corpora Project data by reinstalling with pip:

pip install --upgrade --force-reinstall pycorpora

Usage

Getting the data from a particular Corpora Project file is easy. Here's an example:

import pycorpora
crayola_data = pycorpora.colors.crayola
print crayola_data["colors"][0]["color"] # prints "Almond"

The expression pycorpora.colors.crayola returns data deserialized from the JSON file located at data/colors/crayola.json in the Corpora Project (i.e., this file). You can use this syntax even with more deeply nested subdirectories:

import pycorpora
mr_men_little_miss_data = pycorpora.words.literature.mr_men_little_miss
print mr_men_little_miss_data["little_miss"][-1] # prints "Wise"

You can use from pycorpora import ... to import a particular Corpora Project category:

from pycorpora import governments
print governments.nsa_projects["codenames"][0] # prints "ARTIFICE"

from pycorpora import humans
print humans.occupations["occupations"][0] # prints "accountant"

You can also use square bracket indexing instead of attributes for accessing subcategories and individual corpora (just in case the Corpora Project ever adds files with names that aren't valid Python identifiers):

import pycorpora
import random
fruits = pycorpora.foods["fruits"]
print random.choice(fruits["fruits"]) # prints "pomelo" maybe

Additionally, pycorpora supports an API similar to that provided by the Corpora Project node package:

import pycorpora

# get a list of all categories
pycorpora.get_categories() # ["animals", "archetypes"...]

# get a list of subcategories for a particular category
pycorpora.get_categories("words") # ["literature", "word_clues"...]

# get a list of all files in a particular category
pycorpora.get_files("animals") # ["birds_antarctica", "birds_uk", ...]

# get data deserialized from the JSON data in a particular file
pycorpora.get_file("animals", "birds_antarctica") # returns dict w/data

# get file in a subcategory
pycorpora.get_file("words/literature", "shakespeare_words")

As an extension of this interface, you can also use the get_categories, get_files and get_file methods on individual categories:

import pycorpora

# get a list of files in the "archetypes" category
pycorpora.archetypes.get_files() # ['artifact', 'character', 'event', ...]

# get an individual file from the "archetypes" category
pycorpora.archetypes.get_file("character") # returns dictionary w/data

# get subcategories of a category
pycorpora.words.get_categories() # ['literature', 'word_clues']

Examples

Here are a few quick examples of using data from the Corpora Project to do weird and fun stuff.

Create a list of whimsically colored flowers:

from pycorpora import plants, colors
import random

random_flowers = random.sample(plants.flowers["flowers"], 10)
random_colors = random.sample(
    [item['color'] for item in colors.crayola["colors"]], 10)
for pair in zip(random_colors, random_flowers):
    print " ".join(pair).title()

# outputs (e.g.):
#   Maroon Bergamot
#   Blue Bell Zinnia
#   Pink Flamingo Camellias
#   Tickle Me Pink Begonia
#   Burnt Orange Clover
#   Fuzzy Wuzzy Hibiscus
#   Outer Space Forget Me Not
#   Almond Petunia
#   Pine Green Ladys Slipper
#   Shadow Jasmine

Create random biographies:

from pycorpora import humans, geography
import random

def a_biography():
    return "{0} is a(n) {1} who lives in {2}.".format(
        random.choice(humans.firstNames["firstNames"]),
        random.choice(humans.occupations["occupations"]),
        random.choice(geography.us_cities["cities"])["city"])

for i in range(5):
    print a_biography()

# outputs (e.g.):
#   Jessica is a(n) ceiling tile installer who lives in Grand Forks.
#   Kayla is a(n) substance abuse social worker who lives in Torrance.
#   Luis is a(n) hydrologist who lives in Saginaw.
#   Leah is a(n) heating installer who lives in Danville.
#   Grant is a(n) building inspector who lives in Vineland.

Automated pizza topping-related boasts about your inebriation:

from pycorpora import words, foods
import random

# "I'm so smashed I could eat a pizza with spinach, cheese, *and* hot sauce."
print "I'm so {0} I could eat a pizza with {1}, {2}, *and* {3}.".format(
    random.choice(words.states_of_drunkenness["states_of_drunkenness"]),
    *random.sample(foods.pizzaToppings["pizzaToppings"], 3))

The possibilities... are endless.

History

  • 0.1.2: Python 3 compatibility (contributed by Sam Raker); vastly improved build process (contributed by Hugo van Kemenade).

License

The pycorpora package is MIT licensed (see LICENSE.txt). The data in the Corpora Project is itself in the public domain (CC0).

Acknowledgements

Thanks to Darius Kazemi and all of the Corpora Project contributors!

This package was developed as part of my Spring 2015 research fellowship at ITP. Thank you to the program and its students for their interest and support!

More Repositories

1

pronouncingpy

A simple interface for the CMU pronouncing dictionary
Python
300
star
2

pytracery

Python port of Kate Compton's Tracery text expansion library.
Python
252
star
3

gutenberg-dammit

I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this
Python
209
star
4

rwet

Notebooks and other materials for Reading and Writing Electronic Text
Jupyter Notebook
199
star
5

gutenberg-poetry-corpus

A corpus of poetry from Project Gutenberg
Jupyter Notebook
186
star
6

phonetic-similarity-vectors

Source code to accompany my paper "Poetic sound similarity vectors using phonetic features"
Jupyter Notebook
166
star
7

everywordbot

a simple script for creating @everyword-like twitter services
Python
114
star
8

rwet-examples

Reading and Writing Electronic Text Example Code
Python
95
star
9

pincelate

Easy to use ML model for spelling and sounding out words
Jupyter Notebook
89
star
10

seaduck

A bare-bones simulation-driven narrative framework
JavaScript
86
star
11

dmep-python-intro

Jupyter Notebook
82
star
12

pocket-sp

Design files for my Pocket SP Game Boy mod
73
star
13

pronouncingjs

a simple javascript interface to the CMU pronouncing dictionary (for node and browser!)
JavaScript
69
star
14

text-resources

I have this big list of links to text stuff that I like, so I thought I'd make it into a repository.
67
star
15

material-of-language

Notes and notebooks for Material of Language
Jupyter Notebook
63
star
16

gen-text-workshop

Tutorials, resources and links on generative text.
50
star
17

simpleneighbors

A clean and easy interface for performing nearest-neighbor lookups
Python
50
star
18

nb5js-proof-of-concept

proof of concept for a p5js-specific notebook interface
JavaScript
48
star
19

example-twitter-bot-node

Example Twitter Bot(s) for node.js
JavaScript
41
star
20

sfpc-gen-text-2015

notes and links for generative text workshop at SFPC 2015
40
star
21

nonsense-verse-pycon-2020

Materials for PyCon 2020 Workshop, "Nonsense verse... with Python and machine learning"
Jupyter Notebook
30
star
22

corpus-driven-narrative-generation

Thoughts toward and tutorial on corpus-driven narrative generation
Jupyter Notebook
22
star
23

comexmadivla

Computational exploration of magical and divinatory language
Jupyter Notebook
21
star
24

word-gan-book-generator

Generating books from GANs trained on bitmaps of whole words
Jupyter Notebook
21
star
25

plot-to-poem

"Translate" a plot from Mark Riedl's WikiPlots corpus into a poem. For NaPoGenMo 2017.
Jupyter Notebook
20
star
26

semanticsimilaritychatbot

A tiny unfussy corpus-driven chatbot based on semantic similarity
Python
20
star
27

nanogenmo2014

My NaNoGenMo project for 2014
TeX
18
star
28

bezmerizing

a tiny quirky library with some bezier curve functions
Jupyter Notebook
17
star
29

nanogenmo2015

repository for "our arrival," my nanogenmo 2015 project
Python
16
star
30

linear-lsystem-poetry

a strange textual interface for making poetry with l-systems
JavaScript
15
star
31

predictive-text-and-text-generation

rwet example for binder
Jupyter Notebook
15
star
32

universal-sentence-encoder-xling-runway

Wrapper for Universal Sentence Encoder for use in Runway
Python
12
star
33

mydinosaur

A fun library for bot makers to create RSS feeds for their bots.
Python
12
star
34

eroft

Electronic Rituals, Oracles and Fortune Telling
Jupyter Notebook
11
star
35

fullwidth

A keyboard layout file for OSX that turns your keystrokes into  full-width Unicode characters.
10
star
36

nanogenmo2017

The Average Novel
Jupyter Notebook
9
star
37

iceboxbreakfast

A William Carlos Williams-parodizing Mastodon bot
Python
9
star
38

word-dcgan

why not train a gan on bitmaps of random words? what's the worst that could happen
Python
8
star
39

plaintext-example-files

just a lil collection of plaintext example files
8
star
40

bobey-dig

Moby Dick with a head cold (for NaNoGenMo 2019)
Jupyter Notebook
7
star
41

xterm-ansi-bundle

xterm.js and ansi-escape-sequences bundled for browser use
JavaScript
7
star
42

tracery_kernel

A barebones custom Jupyter kernel for Tracery
Jupyter Notebook
6
star
43

wordfreq-en-25000

quick and dirty dump of 25k English words from wordfreq
Python
6
star
44

processing.py-workshop-examples

Example code for "Introduction to processing.py" workshop given at NYU/ITP, June 2011
Python
5
star
45

mimetree

Last Baby Standing: a Facebook game where you mate your friends together to breed the perfect, space fungus-resistant child. ("Mimetree" was the project code name.)
Python
5
star
46

cashclones

a twitter bot for making alternate history scenarios
Web Ontology Language
4
star
47

characterror

A shmup about spelling
Python
4
star
48

libraryofemoji

Source code for generating emoji names like those seen on @libraryofemoji.
Python
4
star
49

romcomsort

NaNoGenMo 2019 project: sort a bunch of romcoms
Jupyter Notebook
4
star
50

definer-tornado-on-heroku

A web application for randomly defining words. Demonstrates Tornado on Heroku.
Python
4
star
51

treestr

a Python string type that keeps track of its own history and metadata
Python
4
star
52

humanshangingout

can they be deceived? can they be deceived... by a robot?
Python
3
star
53

manipulating-font-data-flat

basic tutorial on using flat's opentype functions
Jupyter Notebook
3
star
54

ezi72ulx

Turn Inform 7 code into .ulx files—fast
Python
3
star
55

shoestrings

markov chain text generation library
Jupyter Notebook
3
star
56

contentmalleable

a javascript snippet for breaking the contents of a contenteditable up into marked-up words
JavaScript
2
star
57

twitteravatarkeyboard

Type using the Twitter avatars of users whose screen names have only one letter.
2
star
58

contextfreegengen

A context-free grammar generator generator, based heavily on Darius Kazemi's GenGen.
HTML
2
star
59

rwet-examples-c

rewriting my RWET examples in C
C
2
star
60

devos-vote

jupyter notebook for democracy
Jupyter Notebook
2
star
61

progdat

Notebooks etc for Programming with Data (NYU ITP)
Jupyter Notebook
2
star
62

sexywindsurfing

automated agent for apples to apples
Python
1
star
63

runway-markov-test

Making a very simple model for Runway
Python
1
star
64

chars74k-json-dump

Convert the stroke trajectories of the hand-drawn English letters in the Chars74k dataset to JSON
Jupyter Notebook
1
star
65

Simple-Flask-Example

A simple web application for word counts in a text file
Python
1
star