Natural Language Processing
Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.
Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.
Check out the companion blog post or the Go documentation page for full usage and examples.
Features
- LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
- Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
- Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
- Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
- PCA (Principal Component Analysis)
- TF-IDF weighting to account for frequently occuring words
- Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
- Stop word removal to remove frequently occuring English words e.g. "the", "and"
- Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
- Similarity/distance measures to calculate the similarity/distance between feature vectors.
Planned
- Expanded persistence support
- Stemming to treat words with common root as the same e.g. "go" and "going"
- Clustering algorithms e.g. Heirachical, K-means, etc.
- Classification algorithms e.g. SVM, KNN, random forest, etc.
References
- Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
- Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
- Thomo, Alex. Latent Semantic Analysis (Tutorial).
- Latent Semantic Indexing. Standford NLP Course
- Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC β02, 2002, p. 380.
- M. Bawa, T. Condie, and P. Ganesan, βLSH forest: self-tuning indexes for similarity search,β Proc. 14th Int. Conf. World Wide Web - WWW β05, p. 651, 2005.
- A. Gionis, P. Indyk, and R. Motwani, βSimilarity Search in High Dimensions via Hashing,β VLDB β99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518β529, 1999.
- Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
- Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
- Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
- QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
- Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation