Olipy
Olipy is a Python library for artistic text generation. Unlike most software packages, which have a single, unifying purpose. Olipy is more like a set of art supplies. Each module is designed to help you achieve a different aesthetic effect.
Setup
Olipy is distributed as the olipy
package on PyPI. Here's how to
quickly get started from a command line:
# Create a virtual environment.
virtualenv env
# Activate the virtual environment.
source env/bin/activate
# Install Olipy within the virtual envirionment.
pip install olipy
# Run an example script.
olipy.apollo
Olipy uses the TextBlob
library
to parse text. Installing Olipy through pip
will install
TextBlob as a dependency, but TextBlob
has extra dependencies (text corpora) which
are not installed by pip
. Instructions for installing the extra
dependencies are on the TextBlob
site, but they boil down to running
this Python
script.
Example scripts
Olipy is packaged with a number of scripts which do fun things with
the data and algorithms. You can run any of these scripts from a
virtual environment that has the olipy
package installed.
olipy.apollo
: Generates dialogue between astronauts and Mission Control. Demonstrates Queneau assembly on dialogue.olipy.board_games
: Generates board game names and descriptions. Demonstrates complex Queneau assemblies.olipy.corrupt
"Corrupts" whatever text is typed in by adding increasing numbers of diacritical marks. Demonstrates thegibberish.Corruptor
class.olipy.dinosaurs
: Generates dinosaur names. Demonstrates Queneau assembly on parts of a word.olipy.ebooks
: Selects some lines from a public domain text using the *_ebooks algorithm. Demonstrates theolipy.gutenberg.ProjectGutenbergText
andolipy.ebooks.EbooksQuotes
classes.olipy.gibberish
: Prints out 140-character string of aesthetically pleasing(?) gibberish. Demonstrates thegibberish.Gibberish
class.olipy.mashteroids
: Generates names and IAU citations for minor planets. Demonstrates Queneau assembly on sentences.olipy.sonnet
: Generates Shakespearean sonnets using Queneau assembly.olipy.typewriter
: Retypes whatever you type into it, with added typoes.olipy.words
: Generates common-looking and obscure-looking English words.
Module guide
alphabet.py
A list of interesting groups of Unicode characters -- alphabets, shapes, and so on.
from olipy.alphabet import Alphabet
print(Alphabet.default().random_choice())
# ๐๐
โญ๐๐๐๐โโ๐๐๐๐๐๐๐๐โ๐๐๐๐๐๐๐โจ๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท
print(Alphabet.default().random_choice())
# โโโโโโคโฌโดโผโโโโโโโโโโโโโโโโโ โกโขโฃโคโฅโฆโงโจโฉโชโซโฌโดโตโถโท
This module is used heavily by gibberish.py.
corpora.py
This module makes it easy to load datasets from Darius Kazemi's Corpora Project, as well as additional datasets specific to Olipy -- mostly large word lists which the Corpora Project considers out of scope. (These new datasets are discussed at the end of this document.)
Olipy is packaged with a complete copy of the data from the Corpora Project, so you don't have to install anything extra. However, installing the Corpora Project data some other way can give you datasets created since the Olipy package was updated.
The interface of the corpora
module is that used by Allison Parrish's
pycorpora
project. The
datasets show up as Python modules which contain Python data
structures:
from olipy import corpora
for city in corpora.geography.large_cities['cities']:
print(city)
# Akron
# Albequerque
# Anchorage
# ...
You can use from corpora import
... to import a particular Corpora
Project category:
from olipy.corpora import governments
print(governments.nsa_projects["codenames"][0] # prints "ARTIFICE")
from olipy.pycorpora import humans
print(humans.occupations["occupations"][0] # prints "accountant")
Additionally, corpora supports an API similar to that provided by the Corpora Project node package:
from olipy import corpora
# get a list of all categories
corpora.get_categories() # ["animals", "archetypes"...]
# get a list of subcategories for a particular category
corpora.get_categories("words") # ["literature", "word_clues"...]
# get a list of all files in a particular category
corpora.get_files("animals") # ["birds_antarctica", "birds_uk", ...]
# get data deserialized from the JSON data in a particular file
corpora.get_file("animals", "birds_antarctica") # returns dict w/data
# get file in a subcategory
corpora.get_file("words/literature", "shakespeare_words")
ebooks.py
A module for incongruously sampling texts in the style of the infamous https://twitter.com/horse_ebooks. Based on the https://twitter.com/zzt_ebooks algorithm by Allison Parrish.
from olipy.ebooks import EbooksQuotes
from olipy import corpora
data = corpora.words.literature.fiction.pride_and_prejudice
for quote in EbooksQuotes().quotes_in(data['text']):
print(quote)
# They attacked him in various ways--with barefaced
# An invitation to dinner
# Mrs. Bennet
# ...
Example scripts for ebooks.py:
- example.ebooks.py: Selects some lines from a Project Gutenberg text, with a bias towards the keywords you give it as command-line arguments.
gibberish.py
A module for those interested in the appearance of Unicode glyphs. Its main use is generating aesthetically pleasing gibberish using selected combinations of Unicode code charts.
from olipy.gibberish import Gibberish
print(Gibberish.random().tweet().encode("utf8"))
# เง ๐ง๐เฆฆ๐๐เง๐๐๐เฆเงญเงญเฆ๐เงถเงฆเฆงเฆช๐คเงฏเงฐเงชเฆกเฆผเฆเฆฌเฆจเฆจเฆคเงฒเฆซเฆ๐เงดเงเงเงฆเงเฆเฆ เงฐ๐๐ฅเฆเฆจเฆฟเงถเฆ๐เฆเฆ๐คเฆเฆเฆคเฆพเงเงเฆซเงฎเงฌเงธเฆเฆเฆซ๐เฆเฆฎเฆขเงญเงเฆฃเฆเฆ๐๐เงเฆเฆฟเง๐๐เงบ๐คเงบเฆญ๐เงญ๐คเงกเงฐเฆฒ๐เฆขเฆผเง๐
เฆฏเฆฅเฆเงฑเฆ
# เฆเฆเงซเฆฝ๐เงฉเฆผเฆฆ๐เง เฆธเงเฆฏเฆผเฆเฆถ๐๐๐เฆเงฐเฆธเฆ๐เฆ
๐๐๐จเฆผเฆฆเงฏเงเงซ ๐
gutenberg.py
A module for dealing with texts from Project Gutenberg. Strips headers and footers, and parses the text.
from olipy import corpora
from olipy.gutenberg import ProjectGutenbergText
text = corpora.words.literature.nonfiction.literary_shrines['text']
text = ProjectGutenbergText(text)
print(len(text.paragraphs))
# 1258
ia.py
A module for dealing with texts from Internet Archive.
import random
from olipy.ia import Text
# Print a URL to the web reader for a specific title in the IA collection.
item = Text("yorkchronicle1946poqu")
print(item.reader_url(10))
# https://archive.org/details/yorkchronicle1946poqu/page/n10
# Pick a random page from a specific title, and print a URL to a
# reusable image of that page.
identifier = "TNM_Radio_equipment_catalog_fall__winter_1963_-_H_20180117_0150"
item = Text(identifier)
page = random.randint(0, item.pages-1)
print(item.image_url(page, scale=8))
# https://ia600106.us.archive.org/BookReader/BookReaderImages.php?zip=/30/items/TNM_Radio_equipment_catalog_fall__winter_1963_-_H_20180117_0150/TNM_Radio_equipment_catalog_fall__winter_1963_-_H_20180117_0150_jp2.zip&file=TNM_Radio_equipment_catalog_fall__winter_1963_-_H_20180117_0150_jp2/TNM_Radio_equipment_catalog_fall__winter_1963_-_H_20180117_0150_0007.jp2&scale=8
letterforms.py
A module that knows things about the shapes of Unicode glyphs.
alternate_spelling
translates from letters of the English alphabet
to similar-looking characters.
from olipy.letterforms import alternate_spelling
print(alternate_spelling("I love alternate letterforms."))
# ใฑ ๐ณ๐ฎโ๐ ๐๐ตโฏโ โ๐๏ฝโซช๐ ๐๐พ฿๐แฅฑ๐ง฿๐ โแ ๐.
markov.py
A module for generating new token lists from old token lists using a Markov chain.
Olipy's primary purpose is to promote alternatives to Markov chains (such as Queneau assembly and the *_ebooks algorithm), but sometimes you really do want a Markov chain. Queneau assembly is usually better than a Markov chain above the word level (constructing paragraphs from sentences) and below the word level (constructing words from phonemes), but Markov chains are usually better when assembling sequences of words.
markov.py was originally written by Allison "A. A." Parrish.
from olipy.markov import MarkovGenerator
from olipy import corpora
text = corpora.words.literature.nonfiction.literary_shrines['text']
g = MarkovGenerator(order=1, max=100)
g.add(text)
print(" ".join(g.assemble()))
# The Project Gutenberg-tm trademark. Canst thou, e'en thus, thy own savings, went as the gardens, the club. The quarrel occurred between
# him and his essay on the tea-table. In these that, in Lamb's day, for a stray
# relic or four years ago, taken with only Adam and _The
# Corsair_. Writing to his home on his new purple and the young man you might
# mean nothing on Christmas sports and art seriously instead of references to
# the heart'--allowed--yet I got out and more convenient.... Mr.
mosaic.py
Tiles Unicode characters together to create symmetrical mosaics. gibberish.py uses this module as one of its techniques. Includes information on Unicode characters whose glyphs appear to be mirror images.
from olipy.mosaic import MirroredMosaicGibberish
mosaic = MirroredMosaicGibberish()
print(mosaic.tweet())
# โโโโโโโโโโโโ
# โโโโโโโโโโโโ
# โโโโโโโโโโโโ
# โโโโโโโโโโโโ
# โโโโโโโโโโโโ
print(gibberish.tweet())
# ๐๐๐ฏ๐ถ๐๐๐๐๐ถ๐ฏ๐๐
# โ๐ถ๐๐ฏ๐๐ ๐ ๐๐ฏ๐๐ถโ
# ๐๐๐๐๐ฒ๐๐๐ฒ๐๐๐๐
# โ๐ถ๐๐ฏ๐๐ ๐ ๐๐ฏ๐๐ถโ
# ๐๐๐ฏ๐ถ๐๐๐๐๐ถ๐ฏ๐๐
queneau.py
A module for Queneau assembly, a technique pioneered by Raymond Queneau in his 1961 book "Cent mille milliards de poรจmes" ("One hundred million million poems"). Queneau assembly randomly creates new texts from a collection of existing texts with identical structure.
from olipy.queneau import WordAssembler
from olipy.corpus import Corpus
assembler = WordAssembler(Corpus.load("dinosaurs"))
print(assembler.assemble_word())
# Trilusmiasunaus
randomness.py
Techniques for generating random patterns that are more sophisticated
than random.choice
.
Gradient
The Gradient
class generates a string of random choices that are
weighted towards one set of options near the start, and weighted
towards another set of options near the end.
Here's a gradient from lowercase letters to uppercase letters:
from olipy.randomness import Gradient
import string
print("".join(Gradient.gradient(string.lowercase, string.uppercase, 40)))
# rkwyobijqQOzKfdcSHIhYINGrQkBRddEWPHYtORB
WanderingMonsterTable
The WanderingMonsterTable
class lets you make a weighted random selection from
one of four buckets. A random selection from the "common" bucket will show up 65% of the time, a
selection from the "uncommon" bucket 20% of the time, "rare" 11% of the time, and "very rare" 4% of
the time. (It uses the same probabilities as the first edition of Advanced Dungeons & Dragons.)
from olipy.randomness import WanderingMonsterTable
monsters = WanderingMonsterTable(
common=["Giant rat", "Alligator"],
uncommon=["Orc", "Hobgoblin"],
rare=["Mind flayer", "Neo-otyugh"],
very_rare=["Flumph", "Ygorl, Lord of Entropy"],
)
for i in range(5):
print monsters.choice()
# Giant rat
# Alligator
# Alligator
# Orc
# Giant rat
tokenizer.py
A word tokenizer that performs better than NLTK's default tokenizers on some common types of English.
from nltk.tokenize.treebank import TreebankWordTokenizer
s = '''Good muffins cost $3.88\\nin New York. Email: [email protected]'''
TreebankWordTokenizer().tokenize(s)
# ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Email', ':', 'muffins', '@', 'example.com']
WordTokenizer().tokenize(s)
# ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Email:', '[email protected]']
typewriter.py
Simulates the Adler Universal 39 typewriter used in The Shining and the sorts of typos that would be made on that typewriter. Originally written for @a_dull_bot.
from olipy.typewriter import Typewriter
typewriter = Typewriter()
typewriter.type("All work and no play makes Jack a dull boy.")
# 'All work and no play makes Jack a dull bo6.'
Extra corpora
Olipy makes available several word lists and datasets that aren't in
the Corpora Project. These datasets (as well as the standard Corpora
Project datasets) can be accessed through the corpora
module. Just
write code like this:
from olipy import corpora
nouns = corpora.words.common_nouns['abstract_nouns']
corpora.geography.large_cities
Names of large U.S. and world cities.
corpora.geography.us_states
The fifty U.S. states.
corpora.language.languages
Names of languages defined in ISO-639-1
corpora.language.unicode_code_sheets
The name of every Unicode code sheet, each with the characters found on that sheet.
corpora.science.minor_planet_details
'name', 'number' and IAU 'citation' for named minor planets (e.g. asteroids) as of July 2013. The 'discovery' field contains discovery circumstances. The 'suggested_by' field, when present, has been split out from the end of the original IAU citation with a simple heuristic. The 'citation' field has then been tokenized into sentences using NLTK's Punkt tokenizer and a set of custom abbreviations.
Data sources: http://www.minorplanetcenter.net/iau/lists/NumberedMPs.html http://ssd.jpl.nasa.gov/sbdb.cgi
This is more complete than the Corpora Project's minor_planets
,
which only lists the names of the first 1000 minor planets.
corpora.words.adjectives
About 5000 English adjectives, sorted roughly by frequency of occurrence.
corpora.words.common_nouns
Lists of English nouns, sorted roughly by frequency of occurrence.
Includes:
abstract_nouns
like "work" and "love".concrete_nouns
like "face" and "house".adjectival_nouns
-- nouns that can also act as adjectives -- like "chance" and "light".
corpora.words.common_verbs
Lists of English verbs, sorted roughly by frequency of occurrence.
present_tense
verbs like "get" and "want".past_tense
verbs like "said" and "found".gerund
forms like "holding" and "leaving".
corpora.words.english_words
A consolidated list of about 73,000 English words from the FRELI project. (http://www.nkuitse.com/freli/)
corpora.words.scribblenauts
The top 4000 nouns that were 'concrete' enough to be summonable in the 2009 game Scribblenauts. As always, this list is ordered with more common words towards the front.
corpora.words.literature.board_games
Information about board games, collected from BoardGameGeek in July 2013. One JSON object per line.
Data source: http://boardgamegeek.com/wiki/page/BGG_XML_API2
corpora.words.literature.fiction.pride_and_prejudice
The complete text of a public domain novel ("Pride and Prejudice" by Jane Austen).
corpora.words.literature.nonfiction.apollo_11
Transcripts of the Apollo 11 mission, presented as dialogue, tokenized into sentences using NLTK's Punkt tokenizer. One JSON object per line.
Data sources: The Apollo 11 Flight Journal: http://history.nasa.gov/ap11fj/ The Apollo 11 Surface Journal: http://history.nasa.gov/alsj/ "Intended to be a resource for all those interested in the Apollo program, whether in a passing or scholarly capacity."
corpora.words.literature.nonfiction.literary_shrines
The complete text of a public domain nonfiction book ("Famous Houses and Literary Shrines of London" by A. St. John Adcock).
corpora.words.literature.gutenberg_id_mapping
Maps old-style (pre-2007) Project Gutenberg filenames to the new-style ebook IDs. For example, "/etext95/3boat10.zip" is mapped to the number 308 (see http://www.gutenberg.org/ebooks/308). Pretty much nobody needs this.