transorthogonal-linguistics
If heroku is running, checkout the live demo (it may take 30 seconds to warm up):
https://transorthogonal-linguistics.herokuapp.com/
Introduction
Words rarely exist in a vacuum. To understand the meaning of the word cat, it's useful to know that it is (hypernym) an animal, that it is the same as (synonym) a feline, that a Tabby is a type of (hyponym) cat, and that in some reasonable sense it is the opposite (antonym) of a dog. Since words are connected in a rich network of linguisitic information, why not (literally) follow that path and see where it takes us?
Instead of looking at a single word in isolation, this project tries elucidate what words should be in between a start and end word.
Grouping words together is a classic problem in computational linguisitics. Typical approaches use LSA, LSI, LDA or Pachinko allocation. Personally, I perfer Word2Vec which was developed by some lovely engineers from Google. Partly because there exists an excellent port to Python via gensim, but mostly because it's awesome.
Word2Vec maps each word to a point on a unit hypersphere. Words that are "close" on this sphere often share some kind of semantic relation. If we pick two words, say "boy" and "man", we can trace the shortest path that connects them. We parameterize this curve with a "time" where t=0 (at boy) and t=1 (at man). Words that are close to this timeline are selected and ordered by their t value (e.g. to the t where they are closest to the connecting curve). In theory, this timeline should be a semantic map from one word to another -- smoothly varying across meaning.
In practice however, it turns out that computing the true curve across the hypersphere is rather tricky. It's even harder to numerically find the nearest points efficiently. However if we cheat a little, we can draw a straight line connecting the two points as an approximation to the curve. If we do this, the problem reduces down to a fast linear algebra solution. Since we are moving across (trans) the orthogonal space spanned by the word2vec's construction, we call this method transorthogonal linguistics.
Data construction
The database contained within this repo was constructed from a full English dump of Wikipedia that was sentence and word tokenized by NLTK. Word2Vec training was done with a single pass, 300 dimensions and an 800 minimum vocabulary count. These choices were found to be optimal for the results, yet still be small enough to query online reasonably quickly.
Command-line interface
python transorthogonal_linguistics/word_path.py boy man
Examples
With the input of boy
and man
we get:
boy
to man
boy
-
sixteen-year-old, orphan
teenager, girl, schoolgirl
youngster, shepherd, lad, kid
kitten, lonely, maid
beggar, policeman
prostitute, thug, villager, handsome, loner, thief, cop
gentleman, stranger, lady, Englishman, guy
-
woman
person
man
sun
to moon
sun
sunlight, mist
glow, shine, clouds
skies, shines, shining, glare, moonlight, sky, darkness
shadows, heavens
horizon, crescent
earth, eclipses
constellations, comet, planets, orbits, orbiting, Earth, Io
Jupiter, planet, Venus, Pluto, Uranus, orbit
-
moons, lunar
moon
Other interesting examples:
girl woman
lover sinner
fate destiny
god demon
good bad
mind body
heaven hell
American Soviet
idea action
socialism capitalism
Marxism Stalinism
man machine
sustenance starvation
war peace
predictable idiosyncratic
acceptance uproar