• Stars
    star
    101
  • Rank 338,166 (Top 7 %)
  • Language
    Python
  • Created over 9 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Uses a distributed word representation to finds words along the hyperchord of two input words.

transorthogonal-linguistics

Travis Hoppe

If heroku is running, checkout the live demo (it may take 30 seconds to warm up):

https://transorthogonal-linguistics.herokuapp.com/

Introduction

Words rarely exist in a vacuum. To understand the meaning of the word cat, it's useful to know that it is (hypernym) an animal, that it is the same as (synonym) a feline, that a Tabby is a type of (hyponym) cat, and that in some reasonable sense it is the opposite (antonym) of a dog. Since words are connected in a rich network of linguisitic information, why not (literally) follow that path and see where it takes us?

Instead of looking at a single word in isolation, this project tries elucidate what words should be in between a start and end word.

Grouping words together is a classic problem in computational linguisitics. Typical approaches use LSA, LSI, LDA or Pachinko allocation. Personally, I perfer Word2Vec which was developed by some lovely engineers from Google. Partly because there exists an excellent port to Python via gensim, but mostly because it's awesome.

Word2Vec maps each word to a point on a unit hypersphere. Words that are "close" on this sphere often share some kind of semantic relation. If we pick two words, say "boy" and "man", we can trace the shortest path that connects them. We parameterize this curve with a "time" where t=0 (at boy) and t=1 (at man). Words that are close to this timeline are selected and ordered by their t value (e.g. to the t where they are closest to the connecting curve). In theory, this timeline should be a semantic map from one word to another -- smoothly varying across meaning.

In practice however, it turns out that computing the true curve across the hypersphere is rather tricky. It's even harder to numerically find the nearest points efficiently. However if we cheat a little, we can draw a straight line connecting the two points as an approximation to the curve. If we do this, the problem reduces down to a fast linear algebra solution. Since we are moving across (trans) the orthogonal space spanned by the word2vec's construction, we call this method transorthogonal linguistics.

Data construction

The database contained within this repo was constructed from a full English dump of Wikipedia that was sentence and word tokenized by NLTK. Word2Vec training was done with a single pass, 300 dimensions and an 800 minimum vocabulary count. These choices were found to be optimal for the results, yet still be small enough to query online reasonably quickly.

Command-line interface

python transorthogonal_linguistics/word_path.py boy man

Examples

With the input of boy and man we get:

boy to man

boy
- 
sixteen-year-old, orphan
teenager, girl, schoolgirl
youngster, shepherd, lad, kid
kitten, lonely, maid
beggar, policeman
prostitute, thug, villager, handsome, loner, thief, cop
gentleman, stranger, lady, Englishman, guy
-
woman
person
man

sun to moon

sun
sunlight, mist
glow, shine, clouds
skies, shines, shining, glare, moonlight, sky, darkness
shadows, heavens
horizon, crescent
earth, eclipses
constellations, comet, planets, orbits, orbiting, Earth, Io
Jupiter, planet, Venus, Pluto, Uranus, orbit
-
moons, lunar
moon

Other interesting examples:

girl woman
lover sinner
fate destiny
god demon
good bad
mind body
heaven hell
American Soviet
idea action    
socialism capitalism
Marxism Stalinism
man machine
sustenance starvation
war peace
predictable idiosyncratic
acceptance uproar

More Repositories

1

pixelhouse

A minimalist drawing library for making beautiful animations in python
Python
348
star
2

orthographic-pedant

Correcting common typos in GitHub one pull request at a time.
Python
140
star
3

NansAreNumbers

An esoteric data type built entirely of NaNs.
Python
75
star
4

python-hyperoperators

Python library for representing really, really, ridiculously large numbers.
Python
64
star
5

instafilter

Instagram-like filters with deep learning
Python
54
star
6

streamlit-CLIP-Unsplash-explorer

Explore the image embeddings of Unsplash using CLIP's image similarity
CSS
50
star
7

DeepMDMA

Hallucinations of a neural network set to music
Python
49
star
8

alph-the-sacred-river

AI poetic imagery
Python
38
star
9

5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

RNN-LSTM that learns passwords from a starting list
HTML
37
star
10

today-AI-learned

Training a classifier to reddit's TIL to find new things on Wikipedia
Python
35
star
11

arXiv2git

Chrome extension that links arXiv papers to github repos
Python
21
star
12

miniprez

Simple markup to web-friendly presentations that look great on mobile and on the big screen.
JavaScript
20
star
13

tf_thomson_charges

TensorFlow solution to the Thomson problem
Python
19
star
14

The-Pile-PubMed

Download, parse, and filter data PubMed, data-ready for The-Pile
Python
16
star
15

Landmark

Social network smart-contract on the ethereum blockchain
JavaScript
15
star
16

RNN_science_titles

Do you even science, bro? Using RNN's to predict scientific titles.
Python
14
star
17

Federal-AI-inventory-analysis-2023

Analysis of the projects reported on the Federal inventory for EO 13960
Python
13
star
18

Encyclopedia-of-Finite-Graphs

Set of tools and data to compute all known invariants for simple connected graphs
C++
13
star
19

PDF_steganography

Hide information in the font mapping of a PDF.
Python
12
star
20

deep-phonics

Deep learning spelling patterns with a recurrent neural network
HTML
12
star
21

Presentation_DCDW_Feb_2014_pyparsing

DCDW Presentation: Pyparsing: helping data get its sexy back
Python
12
star
22

tor_spiders

Spiders a website using the darknet via Tor
Python
11
star
23

Cayley-Dickson

Cayley Dickson algebra implementation in python
Python
11
star
24

streamlit-skyAR

Streamlit demo of the skyAR model
Python
10
star
25

The-Pile-PhilPapers

Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.
Python
9
star
26

beframe

What's left of a movie when you remove the action and talking?
Python
9
star
27

artisanal-boxes.js

Premium, hand-crafted js boxes
JavaScript
8
star
28

code-linguistics

Computational statistics on the (key)word choices of code
Python
6
star
29

md2reveal

Transforms extended Markdown into reveal.js compatible slides
Python
6
star
30

The-Pile-FreeLaw

Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.
Python
6
star
31

DeepOptimizerViz

Visualize the convergence of complex roots with different optimizers.
Jupyter Notebook
6
star
32

timecube

Python
5
star
33

twitterf_cks

A geographical, statistical, and orthographic study of fucks on twitter.
JavaScript
5
star
34

homophonic-encryption

Hiding words through almost homophones
Python
5
star
35

Colorless-Green-Ideas

Generalizing Chomsky's famous sentence into syntactic singular vectors.
Python
5
star
36

Twitter-Ethics-Challenge-PixelPerfect

Submission to Twitter's algorithmic bias bounty challenge
Python
4
star
37

ex-libris

Novel book posters
Makefile
4
star
38

corinthian_filter

Nightmare filter inspired from Neil Gaiman Corinthian
Python
4
star
39

TwitterSquares

Generate images of followers from twitter search terms
Python
4
star
40

personal_cv

Curriculum vitae and publication list for Travis Hoppe
TeX
4
star
41

greasepaint

A python library to manipulate the faces. Think snapchat but weirder.
HTML
4
star
42

Presentation-Black-Hack-Data-Wrangling

Presentation for DC Data Wranglers - BHDW
HTML
4
star
43

SyntheticCountenance

Latent exploration of your own face
Python
3
star
44

meta-graph

Generating and documenting meta-graphs
Python
3
star
45

automagic-api

Automagically creates an API for a website that lacks one
Python
3
star
46

bots-reading-bots

Training hundreds of tiny RNN to learn a book and read another one
HTML
2
star
47

baby_neural_nets

Exploring raw networks learn by watching the temporal evolution of their weights.
Python
2
star
48

zeroshot-api

Zero-shot module with caching using huggingface / FastAPI / redis
Python
2
star
49

trapezium

Gyroscope and accelerometer measurements of the human body.
Python
2
star
50

godwins_law

A empirical test of Godwin's Law
HTML
2
star
51

Benchmarks-in-Sampling-Algorithms

Python
2
star
52

Presentation_Irrelevant_Topics_In_Physics

Irrelevant Topics In Physics
HTML
2
star
53

zerozerozero

Buddhabrot-style polynomials
Python
2
star
54

Girls-Interrupted

Screentime for faces and gender
Python
2
star
55

Presentation_Topics

Holding set for public talks
2
star
56

thoppe.github.io

HTML
2
star
57

imaginary_rotoscopes

Rotoscoping the motion of roots across the complex plane
Python
2
star
58

The-Pile-EuroParl

Download, parse, and filter data from European Parliament Proceedings. Data-ready for The-Pile.
Python
2
star
59

Presentation_Research_IDP

Research presentation on Intrinsically Disordered Proteins (IDP's)
Python
2
star
60

postern_perception

experiments in human-GAN vision
Python
1
star
61

markdown_math

Small set of utility scripts that renders the equations in a markdown file
Python
1
star
62

Presentation_Enhancing_coevol_signal

Presentation for the scientific work, "Enhancing the coevolutionary signal"
HTML
1
star
63

NCHS-public-metadata-validator

Validates NCHS metadata
Python
1
star
64

Metric_Time

A proposal for the metric time unit, the pong
CSS
1
star
65

NCHS-data-remodeling

Converting NCHS fixed-width datasets into a modern format in a reproducible way
Python
1
star
66

flower-boxes.js

A small demo using artisanal-boxes.js
1
star
67

Presentation_NIST_crowding

Slides for NIST talk
HTML
1
star
68

dspipe

Easy to use data science pipes
Python
1
star
69

mythical-mentor

Use AI to generate worlds and mythos automatically
Python
1
star
70

make-it-work.js

JS utility to size boxes
JavaScript
1
star
71

ars_metrica

The art of meter in names
Python
1
star
72

LigaturaObscura

Hiding messages in a ligatures of a font
Python
1
star
73

NewNewYork

Deep vaporwave aesthetic using drone footage, learned colors from art, and depth map blurs.
Python
1
star
74

presentation-twitter-amplifers

Talk: A short study of twitter bots used by the alt-right
JavaScript
1
star
75

buildbot

Buildbot API (uses a neo4j graph database backend)
Python
1
star
76

text-blending

Deep blending text from one author to another
Python
1
star
77

Spectrum

Spectrum, exploring the gender continuum with deep learning
Python
1
star
78

NCHS-Mallet-Dockerized

Dockerized wrapper for LDA software MALLET
Python
1
star
79

awesome-federal-AI-datasets

A list of high quality accessible AI-ready datasets from the US Federal government.
Python
1
star
80

bibcodex

Library to access, analyze, and display bibliographic information
Python
1
star