• Stars
    star
    925
  • Rank 49,060 (Top 1.0 %)
  • Language
    Jupyter Notebook
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A deep dive into embeddings starting from fundamentals

What are embeddings?

This repository contains the generated LaTex document, website, and complementary notebook code for "What are Embeddings".

DOI

Abstract

Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google's Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Running

The LaTex document is written in Overleaf and deployed to GitHub, where it's compiled via Actions. The site is likewise generated via Actions from the site branch. The notebooks are flying fast and free and not under any kind of CI whatsoever.

Contributing

If you have any changes that you'd like to make to the document including clarification or typo fixes, you'll need to build the LaTeX artifact. I use GitHub to track issues and feature requests, as well as accept pull requests. Pull requests are the best way to propose changes to the codebase:

  1. Fork the repo and create your branch from main.
  2. Make your changes in your fork.
  3. Make sure that your LaTeX document compiles. The GH action that triggers the PDF is set to run on PR into main.
  4. Ensure that the document compiles to a PDF correctly and inspect the output.
  5. Make sure your code lints.
  6. Issue that pull request!

Citing

@software{Boykis_What_are_embeddings_2023,
author = {Boykis, Vicki},
doi = {10.5281/zenodo.8015029},
month = jun,
title = {{What are embeddings?}},
url = {https://github.com/veekaybee/what_are_embeddings},
version = {1.0.1},
year = {2023}
}

More Repositories

1

viberary

Good books, good vibes
Jupyter Notebook
411
star
2

textedit

A super-mini Python text editor
Python
78
star
3

soviet-art-bot

A bot that finds tweets socialist realism paintings. v. 0.20
Python
71
star
4

favorite_essays

Updating list of favorite internet essays
47
star
5

til

Today I Learned Some Computer Stuff
39
star
6

hustlr

A web app for HN hustlers
HTML
29
star
7

boringml

Boring ML Generated Site
19
star
8

data

Scripts to manipulate data
Python
15
star
9

datascientistwiki

Wiki of links and data science resources started in datascientists.slack.com
14
star
10

markovhn

Creating Markov chain-generated Hacker News headlines with Python
Python
12
star
11

venti-pytorch

Model for serving venti
Python
10
star
12

caffeine

A tiny, simple Java static site generator
HTML
8
star
13

venti

Python
8
star
14

gandinsky

Fooling around with neural nets and art
Jupyter Notebook
7
star
15

intro-to-sql

Girl Develop It Intro to SQL
JavaScript
7
star
16

swedish-house-ml

A project examining the relationship between nudity in cover art and social media response to music
Jupyter Notebook
6
star
17

data-lake-talk

Slides and code for Data Philly Data Lake Talk
JavaScript
6
star
18

ml-garden

Personal Learning Mind Map
HTML
5
star
19

veekaybee.github.io

Tech blog
HTML
4
star
20

data-lake-code

Code for the Data Lake Talk
Python
4
star
21

recsys-bracket

Recsys March Madness bracket
CSS
4
star
22

wordcloud

Generating wordclouds from Strata conference talks for a blog post
HTML
4
star
23

slatin

A simple transliterator from the Roman alphabet to Cyrillic.
HTML
3
star
24

strata_schedule

Playing with ics in Python
Python
3
star
25

whoshiring

Who's Hiring February 2016
Python
3
star
26

cumtotal

Cumulative totals a couple different ways: R, Python, SQL, etc.
R
3
star
27

viberary_model

ONNX Model for Viberary
3
star
28

nisaba

Telegram Bookmark bot
Python
2
star
29

normcoretech

Website
CSS
2
star
30

hadoop

Anything and everything related to Hadoop
Python
2
star
31

dailyprogrammer

Reddit daily programmer challenge solutions https://www.reddit.com/r/dailyprogrammer/
Python
2
star
32

algorithms

Grokking Algorithms
Python
2
star
33

wired

Wired data for veekaybee.github.io
Python
2
star
34

data-jawn

Data Jawn Keynote 2018
2
star
35

sparkr-examples

Spark R post code
R
1
star
36

pythondatastructures

Python Data Structures on Coursera
Python
1
star
37

latex_resources

Latex resources
HTML
1
star
38

senior-dev-day-talk

Senior Dev Day Talk Slides
JavaScript
1
star
39

javahard

Learn Java the Hard Way
Java
1
star
40

testymctest

1
star
41

jumbotron

Main webpage static site
HTML
1
star
42

dijkstra

Quick graph traversal
Java
1
star
43

wlb

Porting blog from Wordpress
HTML
1
star
44

hugo-test

blog migration
HTML
1
star
45

veekaybee

About
1
star
46

spark-calc

Spark memory settings calculator
HTML
1
star
47

priorityqueue

Priority Queue Reference Implementation
Java
1
star
48

venvcheat

Venv activation cheatsheet
HTML
1
star
49

scalaBlog

bootstrapped Scala blog for Scala Learnings
HTML
1
star
50

tualerts

Pulling 3 years' worth of emails and analyzing from Temple University Campus Alerts
Python
1
star
51

cis111b_project

Java
1
star