• Stars
    star
    183
  • Rank 208,936 (Top 5 %)
  • Language
    Python
  • Created about 12 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Worked examples from the NLTK Book

nltk-examples

src/book/

Worked examples from the NLTK Book

src/cener/

A Consumer Electronics Named Entity Recognizer - uses an NLTK Maximum Entropy Classifier and IOB tags to train and predict Consumer Electronics named entities in text.

src/sameword/

A simple tool to detect word equivalences using Wordnet. Reads a TSV file of word pairs and returns the original (LHS) word if the words don't have the same meaning. Useful (at least in my case) for checking the results of a set of regular expressions to convert words from British to American spellings and for converting Greek/Latin plurals to their singular form (both based on patterns).

src/genetagger/

A Named Entity Recognizer for Genes - uses NLTK's HMM package to build an HMM tagger to recognize Gene names from within English text.

src/langmodel/

A trigram backoff language model trained on medical XML documents, and used to estimate the normalized log probability of an unknown sentence.

src/docsim/

A proof of concept for calculating inter-document similarities for a collection of text documents for a cheating detection system. Contains implementation of the SCAM (Standard Copy Analysis Mechanism) in order to possible near-duplicate documents.

src/phrases/

A proof of concept to identify significant word collocations as phrases from about an hours worth of messages from the Twitter 1% feed, calculated as a log-likelihood ratio of the probability that they are dependent vs that they are independent. Based on the approach described in "Building Search Applications: Lucene, LingPipe and GATE" by Manu Konchady, but extended to handle any size N-gram.

src/medorleg

A trigram interpolated model trained on medical and legal sentences, and used to classify a sentence as one of the two genres.

src/medorleg2

Uses the same training data as medorleg, but uses Scikit-Learn's text API and LinearSVC implementations to build a classifier that predicts the genre of an unseen sentence.

Also contains an ARFF writer to convert the X and y matrices to ARFF format for consumption by WEKA. This was done so we could reuse Scikit-Learn's text processing pipeline to build a WEKA model, which could then be used directly from within a Java based data pipeline.

src/brown_dict

Using a POS tagged (Brown) corpus to build a dictionary of words and their sense frequencies, and using a chunked (Penn Treebank subset) corpus to build a reference set of POS sequences and POS state transitions to allow context free POS tagging of standalone words and phrase type detection of standalone phrases.

src/topicmodel

Topic modeling the PHR corpus with gensim. More information in these posts:

src/stlclust

Using DBSCAN to cluster section titles in clinical notes.

src/semantic

Python/NLTK implementation of the algorithm described in the paper - Sentence Similarity Based on Semantic Nets and Corpus Statistics by Li, et al.

src/drug_ner

Drug name NER using one class classification approach. Only positive training set (drug name ngrams) are provided, along with an unlabelled dataset and estimate of proportion of positive data. More information on my blog post: Classification with Positive Examples only.

src/similar-tweets-nmslib

More information on my blog post: Finding Similar Tweets with BERT and NMSLib.

src/entity_graph

More information on my blog post: Entity Co-occurrence graphs as Mind Map.

More Repositories

1

statlearning-notebooks

Python notebooks for exercises covered in Stanford statlearning class (where exercises were in R).
376
star
2

eeap-examples

Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe
Jupyter Notebook
259
star
3

dl-models-for-qa

Keras DL models to answer 8th grade science multiple choice questions (Kaggle AllenAI competition).
Python
237
star
4

holiday-similarity

Finding similar images in the Holidays dataset
Jupyter Notebook
103
star
5

fttl-with-keras

Transfer Learning and Fine Tuning for Cross Domain Image Classification with Keras
Jupyter Notebook
83
star
6

ner-re-with-transformers-odsc2022

Building NER and RE components using HuggingFace Transformers
Jupyter Notebook
47
star
7

hia-examples

Hadoop In Action Examples
Java
39
star
8

mlia-examples

Python and R Examples
Python
39
star
9

pytorch-gnn-tutorial-odsc2021

Repository for GNN tutorial using Pytorch and Pytorch Geometric (PyG) for ODSC 2021
Jupyter Notebook
36
star
10

reuters-docsim

Different approaches to computing document similarity
Python
28
star
11

mia-scala-examples

Mahout Examples
Scala
26
star
12

keras-tutorial-odsc2020

Notebooks for Keras Tutorial presented at ODSC West 2020
Jupyter Notebook
26
star
13

ltr-examples

Supporting code for Learning to Rank (LTR) presentation
Jupyter Notebook
16
star
14

polydlot

My attempt to learn more than one Deep Learning framework
Jupyter Notebook
16
star
15

intro-dl-talk-code

Jupyter notebooks and code for Intro to DL talk at Genesys
Jupyter Notebook
14
star
16

scalcium

Scala NLP Algorithms
Scala
10
star
17

solr4-extras

Random solr4 customizations
Scala
10
star
18

delsym

An actor based content ingestion pipeline
Scala
10
star
19

nlp-graph-examples

Examples for Graphorum 2019 presentation -- Graph Techniques for Natural Language Processing
Jupyter Notebook
10
star
20

esc

Scala client for ElasticSearch
Scala
9
star
21

deeplearning-ai-examples

Jupyter Notebook
8
star
22

bpwj

Java Parser Development Framework from Steven Metsker's "Building Parsers With Java book"
Java
8
star
23

thinkstats-examples

Worked examples for exercises in Think Stats using the Scientific Python stack.
Jupyter Notebook
8
star
24

saturn-scispacy

SaturnCloud notebooks to extract annotations from CORD-19 dataset using SciSpacy pretrained models
Jupyter Notebook
8
star
25

llm-rag-eval

Large Language Model (LLM) powered evaluator for Retrieval Augmented Generation (RAG) pipelines.
Python
8
star
26

content-engineering-tutorial

Jupyter Notebook
7
star
27

vespa-poc

Small Proof of Concept to familiarize myself with Vespa.ai functionality
Python
7
star
28

neural-re-experiments

Jupyter Notebook
5
star
29

kg-aligned-entity-linker

Knowledge Graph Aligned Entity Linker using BERT and Sentence Transformers
Jupyter Notebook
5
star
30

bayesian-stats-examples

Python versions of things taught in the Bayesian Statistics courses on Coursera
Jupyter Notebook
4
star
31

neurips-papers-node2vec

Jupyter Notebook
3
star
32

tgni

Experimental NER techniques to address common (for me) text analysis problems.
Java
3
star
33

snorkel-pytorch-lstm-gpu

Code for my GPU port of Snorkel 's Pytorch discriminative model (LSTM)
Python
2
star
34

compmethods-notebooks

Python Notebooks for the Computional Methods for Data Analysis course on Coursera.
2
star
35

claimintel

Descriptive Stats on Claims Data
Scala
1
star
36

misc-docs

Account for storing miscellaneous text files for sharing
1
star
37

spark-data-algorithms

Implementations of common data algorithms in Spark
1
star
38

sherpa

Django based web application to help with organizing a conference (summit)
Python
1
star
39

pytorch-drl-examples

Reimplementation of Deep Reinforcement Learning examples from "Deep Reinforcement Learning with Python" by Sudharsan Ravichandran
1
star