There are no reviews yet. Be the first to send feedback to the community and the maintainers!
Approach based upon language model in Bengio et al ICML 09 "Curriculum Learning". You will need my common python library: http://github.com/turian/common and my textSNE wrapper for t-SNE: http://github.com:turian/textSNE You will need Murmur for hashing. easy_install Murmur To train a monolingual language model, probably you should run: [edit hyperparameters.language-model.yaml] ./build-vocabulary.py ./train.py To train word-to-word multilingual model, probably you should run: cd scripts; ln -s hyperparameters.language-model.sample.yaml s hyperparameters.language-model.yaml # Create validation data: ./preprocess-validation.pl > ~/data/SemEval-2-2010/Task\ 3\ -\ Cross-Lingual\ Word\ Sense\ Disambiguation/validation.txt Tokenizer v3 # [optional: Lemmatize] Tadpole --skip=tmp -t ~/dev/python/mt-language-model/neural-language-model/data/filtered-full-bilingual/en-nl/filtered-training.nl | perl -ne 's/\t/ /g; print lc($_);' | chop 3 | from-one-line-per-word-to-one-line-per-sentence.py > ~/dev/python/mt-language-model/neural-language-model/data/filtered-full-bilingual-lemmas/en-nl/filtered-training-lemmas.nl # [TODO: * Initialize using monolingual language model in source language. * Loss = logistic, not margin. ] # [optional: Run the following if your alignment for language pair l1-l2 # is in form l2-l1] ./scripts/preprocess/reverse-alignment.pl ./w2w/build-vocabulary.py # Then see the output with ./w2w/dump-vocabulary.py, to see if you want # to adjust the w2w minfreq hyperparameter ./w2w/build-target-vocabulary.py # Then see the output with ./w2w/dump-target-vocabulary.py ./w2w/build-initial-embeddings.py # [optional: Filter the corpora only to include sentences with certain # focus words.] # You want to make sure this happens AFTER # ./w2w/build-initial-embeddings.py, so you have good embeddings for words # that aren't as common in the filtered corpora. ./scripts/preprocess/filter-sentences-by-lemma.py # You should then move the filtered corpora to a new data directory.] #[optional: This will cache all the training examples onto disk. This will # happen automatically during training anyhow.] ./scripts/w2w/build-example-cache.py ./w2w/train.py TODO: * sqrt scaling of SGD updates * Use normalization of embeddings? * How do we initialize embeddings? * Use tanh, not softsign? * When doing SGD on embeddings, use sqrt scaling of embedding size?
textSNE
2-d visualization of high-dimensional input: Python code for rendering t-SNE code with text labels for each pointtopia.termextract
Updates to Zope's keyphrase extractor (forked from 1.1.0)crfchunking-with-wordrepresentations
Train a CRF for syntactic chunking (CoNLL2000), and use word representationscommon
Common Python library, especially for text processing and controlling experimental runskea-service
KEA 5.0 (keyphrase extraction software), modified to be an XML-RPC servicepytextpreprocess
Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)random-indexing-wordrepresentations
Induce word representations using random indexing (RI)save-my-browser-tabs
Extension for Mozilla Firefox and Google Chrome to save all of your open tabs to a text file (window/tab index, URL and title of each tab)stanford-pos-tagger-service
XML-RPC version of the Stanford POS taggercommon-scripts
Common scripts, mainly for text processing and experimental controlpyrandomprojection
Random projection library for Python, converting a dictionary to low-dimensional numpy matrixdonatefaces
Extract faces from video clips; generate training data for pose-invariant face featurespy80legsformat
In Python, read the .80 file format, for 80legs web crawl results.fatfreecrm-ec2
Deploy FatFree CRM on EC2scikits.learn.recipes
Recipes for scikits.learnbatchtrain
Find the best model, using random hyperparameter optimization, using scikit-learnparser-model
A neural network with a sparse input, for predicting decisions of a natural language syntax parser.django-instantmessage
IM-like application for Pinax social networks (Django), that allow you to see which friends are online and chat themsimple-twitter-similarity
Didactic example of information retrieval, computing the similarity of two twitter userspytc-example
Example code for pytc (Python TokyoCabinet API)osqa
OSQA branch, with some fixesflickorpus
flickorpus collects an image and tag corpus from flickr.biased-text-sample
Perform a biased sample of text datapycrowdflower
Python code for accessing the CrowdFlower APIwikiprep-postprocess
Postprocess XML output from wikiprep (Wikipedia preprocessor) into JSONquery-classification-with-word-representations
KDDCup 2005 query classification with word representationsflann-1.2
Fork of FLANN 1.2, Fast Library for Approximate Nearest Neighborsosqa-install-webfaction
Install OSQA on webfactionwordrepresentations-hmm
HMM model for word representations, using the method of Huang + Yates (2009).fabricrecipes
fabric recipes, primarily for deploying Ubuntu and EC2 instances.doubleblind
Django project to do blind testing and figure out which of your friends post things you actually likerenderman-dexed-linux
Instructions for using the RenderMan Python API for controlling the Dexed FM synthesizer on Linuxsounder
Tinder for discovering musicsearch-autocomplete
Javascript autocomplete, with MySQL/PHP backendpyshortstringcompression
Compress short strings, using the Huffman algorithm.audio-discrimination-crowdsource-batch
Batch processing for audio-discrimination-crowdsourceinverse-audio-synthesis
Inverse audio synthesislanguage-model-linear
A neural language model, intended to produce embeddings for a linear classifierpitch-detection-echonest
Pitch detection, for an audio file, using the Echonest remix APIsoundcloudsampler
A widget to help you quickly sample soundcloud tracks.python-SimpleXMLRPCServer-permissive
A permissive version of the Python SimpleXMLRPCServer, which can correct errant XML input from the client.vworker-select-all-workers-firefox-extension
Firefox extension to select all workers in vWorker search results pageosqa-jsmath
jsMath support for OSQApycrunchbase
Python methods to interact with the Crunchbase API v1.openl3_numpy_weights
OpenL3 audio model weights, in numpy formattransformer-fsd50k
HUBERT or wav2vec2 pretrained on FSD50Klisadiary
A bliki (blog+wiki) compiler, inspired by ikiwikigrab-wikipedia-abstracts
Grab all Wikipedia abstracts, in all languagesaucoder
writing-collaboration
An article about scientific collaborationaudio-discrimination-crowdsource
Web service to crowd-source audio discrimination datadatasciencepatterns
audiojnd
Audio pair JNDkinda-deep
Technical blogsherlock-rest
A Django JSON REST API for Sherlockembeddingcache
Retrieve text embeddings, but cache them locally if we have already computed them.query-categorization-with-word-representations
KDDCup 2005 query classification with word representationsdx7render-docker
Render dx7 patches, dockerizedarchivebox-render
ArchiveBox blueprint for Renderbatch-elki-cluster
grokmusic
Grok your music collection, and save it into a persistent format.Love Open Source and this site? Check out how you can help us