• Stars
    star
    261
  • Rank 151,566 (Top 4 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A word2vec negative sampling implementation with correct CBOW update.

... the Zen attitude is that words and truth are incompatible, or at least that no words can capture truth.

Douglas R. Hofstadter

A word2vec negative sampling implementation with correct CBOW update. kōan only depends on Eigen.

Authors: Ozan İrsoy, Adrian Benton, Karl Stratos

Thanks to Cyril Khazan for helping kōan better scale to many threads.

Menu

Rationale

Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.

We release kōan so that others can efficiently train CBOW embeddings using the corrected weight update. See this technical report for benchmarks of kōan vs. gensim word2vec negative sampling implementations. If you use kōan to learn word embeddings for your own work, please cite:

Ozan İrsoy, Adrian Benton, and Karl Stratos. "Corrected CBOW Performs as well as Skip-gram." The 2nd Workshop on Insights from Negative Results in NLP (2021).

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[2] Karl Stratos, Michael Collins, and Daniel Hsu. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1282–1291, 2015.

See here for kōan embeddings trained on the English cleaned common crawl corpus (C4).

Building

You need a C++17 supporting compiler to build koan (tested with g++ 7.5.0, 8.4.0, 9.3.0, and clang 11.0.3).

To build koan and all tests:

mkdir build
cd build
cmake ..
cmake --build ./

Run tests with (assuming you are still under build):

./test_gradcheck
./test_utils

Installation

Installation is as simple as placing the koan binary on your PATH (you might need sudo):

cmake --install ./

Quick Start

To train word embeddings on Wikitext-2, first clone and build koan:

git clone --recursive [email protected]:bloomberg/koan.git
cd koan
mkdir build
cd build
cmake .. && cmake --build ./
cd ..

Download and unzip the Wikitext-2 corpus:

curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
unzip wikitext-2-v1.zip
head -n 5 ./wikitext-2/wiki.train.tokens

And learn CBOW embeddings on the training fold with:

./build/koan -V 2000000 \
             --epochs 10 \
             --dim 300 \
             --negatives 5 \
             --context-size 5 \
             -l 0.075 \
             --threads 16 \
             --cbow true \
             --min-count 2 \
             --file ./wikitext-2/wiki.train.tokens

or skipgram embeddings by running with --cbow false. ./build/koan --help for a full list of command-line arguments and descriptions. Learned embeddings will be saved to embeddings_${CURRENT_TIMESTAMP}.txt in the present working directory.

License

Please read the LICENSE file.

Benchmarks

See the report for more details.

More Repositories

1

memray

Memray is a memory profiler for Python
Python
12,344
star
2

blazingmq

A modern high-performance open source message queuing system
C++
2,471
star
3

goldpinger

Debugging tool for Kubernetes which tests and displays connectivity between nodes in the cluster.
JavaScript
2,413
star
4

bde

Basic Development Environment - a set of foundational C++ libraries used at Bloomberg.
C++
1,542
star
5

comdb2

Bloomberg's distributed RDBMS
C
1,283
star
6

pystack

🔍 🐍 Like pstack but for Python!
Python
907
star
7

xcdiff

A tool which helps you diff xcodeproj files.
Swift
901
star
8

quantum

Powerful multi-threaded coroutine dispatcher and parallel execution engine
C++
561
star
9

ipydatagrid

Fast Datagrid widget for the Jupyter Notebook and JupyterLab
TypeScript
484
star
10

foml

Foundations of Machine Learning
Handlebars
330
star
11

pytest-memray

pytest plugin for easy integration of memray memory profiler
Python
308
star
12

python-github-webhook

A framework for writing webhooks for GitHub, in Python.
Python
276
star
13

chromium.bb

Chromium source code and modifications
267
star
14

blpapi-node

Bloomberg Open API module for node.js
C++
243
star
15

chef-bcpc

Bloomberg Clustered Private Cloud distribution
Python
228
star
16

phabricator-tools

Phabricator Tools
Python
221
star
17

scatteract

Project which implements extraction of data from scatter plots
Jupyter Notebook
207
star
18

pasta-sourcemaps

Pretty (and) Accurate Stack Trace Analysis is an extension to the JavaScript source map format that allows for accurate function name decoding.
TypeScript
160
star
19

record-tuple-polyfill

A polyfill for the ECMAScript Record and Tuple proposal.
JavaScript
159
star
20

collectdwin

CollectdWin - a system statistics collection daemon for Windows, inspired by 'collectd'
C#
123
star
21

clangmetatool

A framework for reusing code in Clang tools
C++
117
star
22

kubernetes-cluster-cookbook

Ruby
100
star
23

quant-research

A collection of projects published by Bloomberg's Quantitative Finance Research team.
Jupyter Notebook
95
star
24

blpapi-http

HTTP wrapper for Bloomberg Open API
TypeScript
83
star
25

amqpprox

An AMQP 0.9.1 proxy server, designed for use in front of an AMQP 0.9.1 compliant message queue broker such as RabbitMQ.
C++
71
star
26

spire-tpm-plugin

Provides agent and server plugins for SPIRE to allow TPM 2-based node attestation.
Go
68
star
27

bde-tools

Tools for developing and building libraries modeled on BDE
Perl
67
star
28

dataless-model-merging

Code release for Dataless Knowledge Fusion by Merging Weights of Language Models (https://openreview.net/forum?id=FCnohuR6AnM)
Python
65
star
29

repofactor

Tools for refactoring history of git repositories
Perl
63
star
30

chef-bach

Chef recipes for Bloomberg's deployment of Hadoop and related components
Ruby
60
star
31

minilmv2.bb

Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)
Python
60
star
32

ntf-core

Sockets, timers, resolvers, events, reactors, proactors, and thread pools for asynchronous network programming
C++
60
star
33

wsk

A straightforward and maintainable build system from the Bloomberg Graphics team.
JavaScript
57
star
34

git-adventure-game

An adventure game to help people learn Git
Shell
55
star
35

attrs-strict

Provides runtime validation of attributes specified in Python 'attr'-based data classes.
Python
50
star
36

corokafka

C++ Kafka coroutine library using Quantum dispatcher and wrapping CppKafka
C++
49
star
37

cnn-rnf

Convolutional Neural Networks with Recurrent Neural Filters
Python
49
star
38

ppx_string_interpolation

PPX rewriter that enables string interpolation in OCaml
OCaml
44
star
39

selekt

A Kotlin and familiar Android SQLite database library that uses encryption.
Kotlin
44
star
40

bde_verify

Tool used to format, improve and verify code to BDE guidelines
C++
42
star
41

vault-auth-spire

vault-auth-spire is an authentication plugin for Hashicorp Vault which allows logging into Vault using a Spire provided SVID.
Go
41
star
42

rmqcpp

A batteries included C++ RabbitMQ Client Library/API.
C++
40
star
43

spark-flow

Library for organizing batch processing pipelines in Apache Spark
Scala
40
star
44

startup-python-bootcamp

35
star
45

chef-umami

A tool to automatically generate test code for Chef cookbooks and policies.
Ruby
34
star
46

p1160

P1160 Add Test Polymorphic Memory Resource To Standard Library
C++
33
star
47

pycsvw

A tool to read CSV files with CSVW metadata and transform them into other formats.
Python
32
star
48

bde-allocator-benchmarks

A set of benchmarking tools used to quantify the performance of BDE-style polymorphic allocators.
C++
31
star
49

blpapi-hs

Haskell interface to BLPAPI
Haskell
30
star
50

rwl-bench

A set of benchmark tools for reader/writer locks.
C++
28
star
51

entsum

Open Source / ENTSUM: A Data Set for Entity-Centric Extractive Summarization
Jupyter Notebook
27
star
52

bbit-learning-labs

Learning labs curated by BBIT
Python
26
star
53

consul-cluster-cookbook

Wrapper cookbook which installs and configures a Consul cluster.
Ruby
26
star
54

kbir_keybart

Experimental code used in pre-training the KBIR and KeyBART models
Python
26
star
55

presto-accumulo

Presto Accumulo Integration
Java
25
star
56

sgtb

Structured Gradient Tree Boosting
Python
25
star
57

python-comdb2

Python API to Bloomberg's comdb2 database.
Python
23
star
58

jupyterhub-kdcauthenticator

A Kerberos authenticator module for the JupyterHub platform
Python
22
star
59

blazingmq-sdk-python

Python SDK for BlazingMQ, a modern high-performance open source message queuing system.
Python
21
star
60

docket

Tool to make running test suites easier, using docker-compose.
Go
21
star
61

tzcron

A parser of cron-style scheduling expressions.
Python
20
star
62

constant.js

Immutable/Constant Objects for JavaScript
JavaScript
20
star
63

redis-cookbook

A set of Chef recipes for installing and configuring Redis.
HTML
19
star
64

userchroot

A tool to allow controlled access to 'chroot' functionality by users without root permissions
C
19
star
65

go-testgroup

Helps you organize tests in Go programs into groups.
Go
18
star
66

blazingmq-sdk-java

Java SDK for BlazingMQ, a modern high-performance open source message queuing system.
Java
18
star
67

nginx-cookbook

A set of Chef recipes for installing and configuring Nginx.
Ruby
17
star
68

zookeeper-cookbook

A set of Chef recipes for installing and configuring Apache Zookeeper.
Ruby
17
star
69

mynexttalk

16
star
70

chef-bcs

Bloomberg Cloud Storage Chef application
Ruby
16
star
71

vault-cluster-cookbook

Application cookbook which installs and configures Vault with Consul as a backend.
Ruby
15
star
72

git-adventure-game-builder

A set of tools for building a Git adventure game, to help people learn Git
Shell
15
star
73

emnlp20_depsrl

Research code and scripts used in the paper Semantic Role Labeling as Syntactic Dependency Parsing.
Python
14
star
74

MixCE-acl2023

Implementation of MixCE method described in ACL 2023 paper by Zhang et al.
Python
14
star
75

k8eraid

A relatively simple, unified method for reporting on Kubernetes resource issues.
Go
12
star
76

hackathon-aws-cluster

HTML
11
star
77

coffeechat

A simple web application for arranging 'chats over coffee'.
TypeScript
11
star
78

fast-noise-aware-topic-clustering

Research code and scripts used in the Silburt et al. (2021) EMNLP 2021 paper 'FANATIC: FAst Noise-Aware TopIc Clustering'
Python
10
star
79

emnlp21_fewrel

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)
Python
10
star
80

mastering-difficult-conversations

Plan It, Say It, Nail It: Mastering Difficult Conversations
10
star
81

wsk-notify

Simple, customizable console notifications.
JavaScript
10
star
82

jenkins-cluster-cookbook

Ruby
9
star
83

decorator-taxonomy

A taxonomy of Python decorator types.
HTML
9
star
84

pytest-pystack

Pytest plugin that runs PyStack on slow or hanging tests.
Python
9
star
85

tdd-labs

Problems and Solutions for Test-Driven-Development training
JavaScript
9
star
86

argument-relation-transformer-acl2022

This repository contains code for our ACL 2022 Findings paper `Efficient Argument Structure Extraction with Transfer Learning and Active Learning`. We implement an argument structure extraction method based on a pre-trained Transformer model.`
Python
9
star
87

sigir2018-kg-contextualization

8
star
88

bloomberg.github.io

Source code for the https://bloomberg.github.io site
HTML
8
star
89

locking_resource-cookbook

Chef cookbook for serializing access to resources
Ruby
7
star
90

datalake-query-ingester

Python
7
star
91

cobbler-cookbook

A Chef cookbook for installing and maintaining Cobbler
Ruby
7
star
92

p2473

Example code for WG21 paper P2473
Perl
6
star
93

collectd-cookbook

Ruby
6
star
94

Catalyst-Authentication-Credential-GSSAPI

A module that provides integration of the Catalyst web application framework with GSSAPI/SPNEGO HTTP authentication.
Perl
6
star
95

bob-bot

Java
5
star
96

.github

Organization-wide community files
5
star
97

jenkins-procguard

Perl
5
star
98

datalake-query-db-consumer

Python
4
star
99

wsk.example

A sample starter project using wsk.
JavaScript
4
star
100

datalake-metrics-db

Python
3
star