• Stars
    star
    3,292
  • Rank 13,635 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 10 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple, extensible Markov chain generator.

CI Version Build status Code coverage Support Python versions

Markovify

Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.

Why Markovify?

Some reasons:

  • Simplicity. "Batteries included," but it is easy to override key methods.

  • Models can be stored as JSON, allowing you to cache your results and save them for later.

  • Text parsing and sentence generation methods are highly extensible, allowing you to set your own rules.

  • Relies only on pure-Python libraries, and very few of them.

  • Tested on Python 3.7, 3.8, 3.9, and 3.10.

Installation

pip install markovify

Basic Usage

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 280 characters
for i in range(3):
    print(text_model.make_short_sentence(280))

Notes:

  • The usage examples here assume you are trying to markovify text. If you would like to use the underlying markovify.Chain class, which is not text-specific, check out the (annotated) source code.

  • Markovify works best with large, well-punctuated texts. If your text does not use .s to delineate sentences, put each sentence on a newline, and use the markovify.NewlineText class instead of markovify.Text class.

  • If you have accidentally read the input text as one long sentence, markovify will be unable to generate new sentences from it due to a lack of beginning and ending delimiters. This issue can occur if you have read a newline delimited file using the markovify.Text command instead of markovify.NewlineText. To check this, the command [key for key in txt.chain.model.keys() if "___BEGIN__" in key] command will return all of the possible sentence-starting words and should return more than one result.

  • By default, the make_sentence method tries a maximum of 10 times per invocation, to make a sentence that does not overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None. To increase or decrease the number of attempts, use the tries keyword argument, e.g., call .make_sentence(tries=100).

  • By default, markovify.Text tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio and/or max_overlap_total to the make_sentence method. Alternatively, this check can be disabled entirely by passing test_output as False.

Advanced Usage

Specifying the model's state size

State size is a number of words the probability of a next word depends on.

By default, markovify.Text uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:

text_model = markovify.Text(text, state_size=3)

Combining models

With markovify.combine(...), you can combine two or more Markov chains. The function accepts two arguments:

  • models: A list of markovify objects to combine. Can be instances of markovify.Chain or markovify.Text (or their subclasses), but all must be of the same type.
  • weights: Optional. A list — the exact length of models — of ints or floats indicating how much relative emphasis to place on each source. Default: [ 1, 1, ... ].

For instance:

model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)

model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])

This code snippet would combine model_a and model_b, but, it would also place 50% more weight on the connections from model_a.

Compiling a model

Once a model has been generated, it may also be compiled for improved text generation speed and reduced size.

text_model = markovify.Text(text)
text_model = text_model.compile()

Models may also be compiled in-place:

text_model = markovify.Text(text)
text_model.compile(inplace = True)

Currently, compiled models may not be combined with other models using markovify.combine(...). If you wish to combine models, do that first and then compile the result.

Working with messy texts

Starting with v0.7.2, markovify.Text accepts two additional parameters: well_formed and reject_reg.

  • Setting well_formed = False skips the step in which input sentences are rejected if they contain one of the 'bad characters' (i.e. ()[]'")

  • Setting reject_reg to a regular expression of your choice allows you change the input-sentence rejection pattern. This only applies if well_formed is True, and if the expression is non-empty.

Extending markovify.Text

The markovify.Text class is highly extensible; most methods can be overridden. For example, the following POSifiedText class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works; however, be warned: pos_tag is very slow.)

import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Or, you can use spaCy which is way faster:

import markovify
import re
import spacy

nlp = spacy.load("en_core_web_sm")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

The most useful markovify.Text models you can override are:

  • sentence_split
  • sentence_join
  • word_split
  • word_join
  • test_sentence_input
  • test_sentence_output

For details on what they do, see the (annotated) source code.

Exporting

It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text model, use my_text_model.to_json(). For example:

corpus = open("sherlock.txt").read()

text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.

reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(280)

>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'

You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size metadata — via my_text_model.chain.to_json().

Generating markovify.Text models from very large corpora

By default, the markovify.Text class loads, and retains, your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). However, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a) tell Markovify not to retain the original:

with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

And (b) read in the corpus line-by-line or file-by-file and combine them into one model at each step:

combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Markovify In The Wild

Have other examples? Pull requests welcome.

Thanks

Many thanks to the following GitHub users for contributing code and/or ideas:

Initially developed at BuzzFeed.

More Repositories

1

pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Python
6,285
star
2

waybackpack

Download the entire Wayback Machine archive for a given URL.
Python
2,841
star
3

nbpreview

Render Jupyter/IPython notebooks without running a notebook server.
CSS
289
star
4

notebookjs

Render Jupyter/IPython notebooks on the fly, in the browser. (Or on the command line, if you'd like.)
JavaScript
272
star
5

spectra

Easy color scales and color conversion for Python.
Python
257
star
6

envplus

Combine your Python virtualenvs.
Python
115
star
7

weightedcalcs

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
Python
103
star
8

reporter

Literate data analysis with iPython notebooks and Jekyll.
Ruby
92
star
9

twick

Twitter, quick. Fetch and store tweets on short notice.
Python
80
star
10

intro-to-visidata

Source files for "An Introduction to VisiData"
HTML
70
star
11

visidata-plugins

A place for me to share VisiData plugins I've written.
Python
36
star
12

mplstyle

A simple API for setting matplotlib styles, as well as a repository of nice styles.
Python
32
star
13

visidata-cheat-sheet

A one-page cheat sheet for VisiData, available in multiple languages.
HTML
26
star
14

gekyll

A Jekyll plugin for using Git repositories as posts, giving you access to a post's commits, diffs, and more.
Ruby
25
star
15

nbexec

A dead-simple tool for executing Jupyter notebooks from the command line.
Python
20
star
16

Backbone.Table

Render any Backbone.js Collection as an HTML table.
JavaScript
20
star
17

buzzfeed-news-trending-strip

Dataset: BuzzFeed News “Trending” Strip, 2018–2023
Python
19
star
18

tab-bankrupter

A Chrome extension for declaring "tab bankruptcy" without losing all your links.
JavaScript
18
star
19

astronomer

Fetch information about the users who've starred a given GitHub repository.
Python
17
star
20

txtbirds

‾‾\/‾‾
JavaScript
14
star
21

tinyapi

Python wrapper around TinyLetter's publicly accessible — but undocumented — API.
Python
13
star
22

fbpagefeed

A library and command-line tool for fetching Facebook Pages' published posts.
Python
12
star
23

virtualenv-recipes

Recipes for useful Python virtualenvs.
Shell
12
star
24

data-tactics

Half-baked idea: Conceptual building blocks for data analysis.
11
star
25

tinystats

Command-line tool for fetching message, URL, and subscriber data for the TinyLetter newsletters you own.
Python
11
star
26

vinejs

Somewhere between a total joke and a useful library for fetching Vine.co videos.
JavaScript
11
star
27

nicar-2024-pdfplumber-workshop

Jupyter Notebook
11
star
28

mta-colors

CSS & JSON files to help developers use the official colors of New York's Metropolitan Transportation Authority.
CSS
10
star
29

compleat

Fetch autocomplete suggestions from Google Search.
Python
9
star
30

google-table-converter

A browser-based tool for converting Google Spreadsheets into responsive HTML <table>s.
HTML
9
star
31

lede-2023

Jupyter Notebook
8
star
32

nicar-2015-schedule

NICAR 2015 conference schedule as CSV and JSON, plus the underlying Python scraper.
Python
8
star
33

gifparse

[Work in progress.] Parse the GIF 89a file format, down to the minor details. Pure Python, no dependencies.
Python
8
star
34

WRIT1-CE9741

WRIT1-CE9741, Fall 2013, NYU School of Continuing and Professional Studies
Ruby
6
star
35

nicar-2023-pdfplumber-workshop

Jupyter Notebook
6
star
36

csvcat

Efficiently concatenate CSVs (or other tabular text files), stripping extra header lines.
Shell
6
star
37

nicar-2017-schedule

NICAR 2017 conference schedule as JSON and CSV, plus the underlying Python scraper.
Python
6
star
38

babynames

CSVs and parsers for the Social Security Administration's historical baby name data.
Python
5
star
39

minicard

A bare-bones CSS stylesheet for creating "card"-style elements.
CSS
4
star
40

macmailer

Command-line utility and Ruby library for creating/sending messages in OSX's Mail.app program.
Ruby
4
star
41

nicar-now

Your unofficial guide to what's happening next at NICAR 2020.
3
star
42

text-toggle

Let readers toggle between two versions of a text.
JavaScript
3
star
43

fidget

Fidget.js is a small, configurable JavaScript library that resizes blocks of text to fit their containers.
JavaScript
3
star
44

statusfiles

IDEA: A simple, structured, standardized, technology-agnostic way to represent the status of things.
3
star
45

nicar-2018-schedule

Your unofficial guide to what's happening next at NICAR 2018.
Python
3
star
46

glat-glong

Find the precise latitude and longitude of any point on Google Maps. A Chrome extension.
JavaScript
3
star
47

lede-2024

Jupyter Notebook
3
star
48

gmap-button

A JavaScript library for adding buttons to embedded Google Maps.
JavaScript
2
star
49

crochet

Hook into and/or monkeypatch any Ruby class- or instance-method. Provides 'before' and 'after' hooks, plus their destructive evil twins.
Ruby
2
star
50

jub

As in, "get the jub done." Or as in, "jQuery, Underscore, Backbone." It's a shell script that automatically grabs the latest versions of those libraries, so that you can get on with prototyping.
Shell
2
star
51

download-all-attachments-from-a-gmail-conversation

Two methods that *seem* to work...
1
star
52

fbiter

A simple library for iterating through paginated Facebook API endpoints.
Python
1
star
53

weddingroulette

The code behind http://weddingroulette.com/
Ruby
1
star
54

jekyll-auto-s3

Automatically sync your Jekyll project to S3 on every (re)build.
Ruby
1
star
55

griddle

Griddle.js is lightweight tool for creating and manipulating programmable, fluid, shift-able grids.
JavaScript
1
star
56

linstapaper

Article-list and site files for linstapaper.com
JavaScript
1
star
57

nbtemplate

Render iPython notebooks to other layouts, via templates. Library and command-line tool.
Python
1
star
58

nicar-2019-schedule

The NICAR 2019 conference schedule as JSON and CSV files, plus the underlying Python scraper.
Python
1
star
59

parabear

An experiment in stupid-simple HTML article text extraction.
JavaScript
1
star