stanfordnlp/GloVe

Stars
6,867
Rank 5,728 (Top 0.2 %)
Language
C
License
Apache License 2.0
Created about 9 years ago
Updated about 1 year ago

stanfordnlp/GloVe

stanfordnlp

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

GloVe: Global Vectors for Word Representation

nearest neighbors of frog	Litoria	Leptodactylidae	Rana	Eleutherodactylus
Pictures

Comparisons	man -> woman	city -> zip	comparative -> superlative
GloVe Geometry

We provide an implementation of the GloVe model for learning word representations, and describe how to download web-dataset vectors or train your own. See the project page or the paper for more information on glove vectors.

Download pre-trained word vectors

The links below contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors are made available under the Public Domain Dedication and License.

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip [mirror]
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip [mirror]
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download): glove.6B.zip [mirror]
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors, 1.42 GB download): glove.twitter.27B.zip [mirror]

Train word vectors on a new corpus

If the web datasets above don't match the semantics of your end use case, you can train word vectors on your own corpus.

$ git clone https://github.com/stanfordnlp/glove
$ cd glove && make
$ ./demo.sh

Make sure you have the following prerequisites installed when running the steps above:

GNU Make
GCC (Clang pretending to be GCC is fine)
Python and NumPy

The demo.sh script downloads a small corpus, consisting of the first 100M characters of Wikipedia. It collects unigram counts, constructs and shuffles cooccurrence data, and trains a simple version of the GloVe model. It also runs a word analogy evaluation script in python to verify word vector quality. More details about training on your own corpus can be found by reading demo.sh or the src/README.md

License

All work contained in this package is licensed under the Apache License, Version 2.0. See the include LICENSE file.

dspy

DSPy: The framework for programming—not prompting—foundation models

CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

cs224n-winter17-notes

Course notes for CS224N Winter17

pyreft

ReFT: Representation Finetuning for Language Models

treelstm

Tree-structured Long Short-Term Memory networks (http://arxiv.org/abs/1503.00075)

pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

string2string

String-to-String Algorithms for Natural Language Processing

Jupyter Notebook

python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.

mac-network

Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)

phrasal

A large-scale statistical machine translation system written in Java.

spinn

SPINN (Stack-augmented Parser-Interpreter Neural Network): fast, batchable, context-aware TreeRNNs

coqa-baselines

The baselines used in the CoQA paper

cocoa

Framework for learning dialogue agents in a two-player game setting.

stanza-old

Stanford NLP group's shared Python tools.

chirpycardinal

Stanford's Alexa Prize socialbot

stanfordnlp

[Deprecated] This library has been renamed to "Stanza". Latest development at: https://github.com/stanfordnlp/stanza

wge

Workflow-Guided Exploration: sample-efficient RL agent for web tasks

pdf-struct

Logical structure analysis for visually structured documents

edu-convokit

Edu-ConvoKit: An Open-Source Framework for Education Conversation Data

Jupyter Notebook

cs224n-web

http://cs224n.stanford.edu

ColBERT-QA

Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)

stanza-train

Model training tutorials for the Stanza Python NLP Library

phrasenode

Mapping natural language commands to web elements

contract-nli-bert

A baseline system for ContractNLI (https://stanfordnlp.github.io/contract-nli/)

color-describer

Code for Learning to Generate Compositional Color Descriptions

stanza-resources

python-corenlp-protobuf

Python bindings for Stanford CoreNLP's protobufs.

miniwob-plusplus-demos

Demos for the MiniWoB++ benchmark

multi-distribution-retrieval

Code for our paper Resources and Evaluations for Multi-Distribution Dense Information Retrieval

huggingface-models

Scripts for pushing models to huggingface repos

nlp-meetup-demo

sentiment-treebank

Updated version of SST

en-worldwide-newswire

An English NER dataset built from foreign newswire

plot-data

datasets for plotting

Jupyter Notebook

contract-nli

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts

plot-interface

Web interface for the plotting project

handparsed-treebank

Extra hand parsed data for training models

coqa

CoQA -- A Conversational Question Answering Challenge

pdf-struct-models

A repository for hosting models for https://github.com/stanfordnlp/pdf-struct

chirpy-parlai-blenderbot-fork

A fork of ParlAI supporting Chirpy Cardinal's custom neural generator

wob-data

Data for QAWoB and FlightWoB web interaction benchmarks from the World of Bits paper (Shi et al., 2017).

pdf-struct-dataset

Dataset for pdf-struct (https://github.com/stanfordnlp/pdf-struct)

nn-depparser

A re-implementation of nndep using PyTorch.