• Stars
    star
    106
  • Rank 323,925 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

(Mental) maps of texts with kernel density estimation and force-directed networks.

Textplot

War and Peace (click to zoom)

War and Peace

Textplot is a little program that converts a document into a network of terms, with the goal of teasing out information about the high-level topic structure of the text. For each term:

  1. Get the set of offsets in the document where the term appears.

  2. Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

War and Peace

  1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.

  2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top N words (usually 10-20) to get the strongest links. Here's "napoleon":

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]
  1. Shovel all of these links into a network and export a GML file.

Generating graphs

There are two ways to create graphs - you can use the textplot executable from the command line, or, if you want to tinker around with the underlying NetworkX graph instance, you can fire up a Python shell and use the build_graph() helper directly.

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install

From the command line

Then, from the command line, generate graphs with:

texplot generate [IN_PATH] [OUT_PATH] [OPTIONS]

Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with War and Peace:

texplot generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

  • --term_depth=1000 (int) - The number of terms to include in the network. For now, Textplot takes the top N most frequent terms, after stopwords are removed.

  • --skim_depth=10 (int) - The number of connections (edges) to skim off the top of the "topics" computed for each word.

  • --d_weights (flag) - By default, terms that appear in similar locations in the document will be connected by edges with "heavy" weights, the semantic expected by force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the weights will be inverted - use this if you want to do any kind of pathfinding analysis on the graph, where it's generally assumed that edge weights represent distance or cost.

  • --bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but bump it down if you're working with shorter texts.

  • --samples=1000 (int) - The number of equally-spaced points on the X-axis where the kernel density is sampled. 1000 is almost always enough, unless you're working with a huge document.

  • --kernel=gaussian (str) - The kernel function. The scikit-learn implementation also supports tophat, epanechnikov, exponential, linear, and cosine.

From a Python shell

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

In [2]: g = build_graph('war-and-peace.txt')

Tokenizing text...
Extracted 573064 tokens

Indexing terms:
[################################] 124750/124750 - 00:00:06

Generating graph:
[################################] 500/500 - 00:00:03

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an instance of networkx.Graph. Eg, to get degree centralities:

In [3]: import networkx as nx
In [4]: nx.degree_centrality(g.graph)

Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.

More Repositories

1

open-syllabus-project

What can be learned from 1M+ college course syllabi? (OLD)
Python
197
star
2

svg-to-wkt

Convert SVG to WKT for use on maps.
JavaScript
40
star
3

lda

(Old, bad) topic modeling in Python.
Python
20
star
4

earthxray

See through the world!
JavaScript
13
star
5

intra

Search inside of long documents.
Python
12
star
6

humanist

27 years, 11 million words of the Humanist list.
JavaScript
9
star
7

litecoder

US city + state geocoding
Python
8
star
8

ExquisiteHaiku

Collaborative poetry composition.
JavaScript
6
star
9

pyspark-deploy

Lightweight Spark + Python cluster deployment.
Python
5
star
10

osd-dzi-viewer

An static site generator for showing DZI pyramids with OpenSeaDragon.
JavaScript
5
star
11

literary-interior

Surveying the literary interior.
Jupyter Notebook
5
star
12

python-workshop-1

Python for Data Wrangling (Part 1: Introduction)
Python
3
star
13

stacks

A corpus management system for the Stanford Literary Lab.
Python
3
star
14

lint-analysis

Analysis rig for literary interior.
Jupyter Notebook
3
star
15

dhlinks

Humanistic link aggregation.
JavaScript
3
star
16

sentence-ordering

Sentence ordering.
Jupyter Notebook
3
star
17

gutenberg-catalog

JSON dump of the Project Gutenberg catalog.
Python
2
star
18

rll-west

Reading Lists for Life, Ideathon West
JavaScript
2
star
19

positional-topic-modeling

Experiments with intra-text topic modeling based on word order.
Python
2
star
20

tokenizer

A barebones python tokenizer.
Python
2
star
21

hist-vec

How do word vectors change over time?
Jupyter Notebook
2
star
22

dnet

Network analysis on Webster's Unabridged Dictionary.
Python
1
star
23

tech-in-novels-deploy

Analysis rig for tech-in-novels.
Shell
1
star
24

mlm-var

Linguistic variation via masked language models
Python
1
star
25

ctx-attn

Modeling linguistic variation via attention heads over LSTM states.
Jupyter Notebook
1
star
26

eh_old_2

Python
1
star
27

twitter-geo

Analysis of geolocated tweets.
Jupyter Notebook
1
star
28

timeline

CoffeeScript
1
star
29

pull-twitter-followers

Harvest Twitter account followers, via RQ + SQLite.
Python
1
star
30

fuzz

Fuzzy influence.
Jupyter Notebook
1
star
31

neatline-rhine

A Neatline exhibit theme for the "Journey Down the Rhine" project.
JavaScript
1
star
32

cleanconfig

Simple configuration management for Python projects. Just opinionated enough.
Python
1
star
33

twitter-ext

Twitter + Spark
Python
1
star
34

pyspark-deploy-example

Example setup / test driver for pyspark-deploy
Python
1
star
35

GoogleAnalytics

Omeka plugin that inserts a Google Analytics tracking code onto each page of the site.
PHP
1
star
36

field-poetics-rails

Experimental literary aesthetics.
Ruby
1
star
37

omeka-ansible

An Ansible role for Omeka.
1
star
38

vector-arc

Model the conceptual "breadth" or "diversity" of a literary text.
Python
1
star
39

exquisite-haiku

This is the common air that bathes the globe.
JavaScript
1
star
40

sent-order

Sentence ordering
Python
1
star
41

bloom-canon

Bloom's canon, CSV + JSON
HTML
1
star
42

hilt-2016

Data wrangling + web mapping for HILT 2016
Python
1
star
43

eh_old_3

This is the common air that bathes the globe.
Ruby
1
star
44

nl-widget-WordLines

Use D3.js to render lines between map vectors and words in NeatlineText documents.
JavaScript
1
star
45

dclure

WP theme for dclure.org. Based on the Toolbox theme by Automattic.
PHP
1
star
46

radio.controller

A lightweight controller for Backbone, for use with backbone.radio.
JavaScript
1
star