• Stars
    star
    411
  • Rank 105,247 (Top 3 %)
  • Language
    Jupyter Notebook
  • Created about 2 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Good books, good vibes

Viberary

Since the project is actively in exploration and development, there are a lot of winding codepaths, experiments, and dead ends in the codebase. It is not production-grade yet for ANY definition of production.

🚧 🚧

Viberary is a search engine that will eventually recommend you books based not on genre or title, but vibe by performing semantic search across a set of learned embeddings on a dataset of books from Goodreads and their metadata.

The idea is to return book recommendations based on the vibe of the book that you put in. So you don't put in "I want science fiction", you'd but in "atmospheric, female lead, worldbuilding, funny" as a prompt, and get back a list of books. This project came out of experiences I had where recommendations for movies, TV, and music have fairly been good, but book recommendations are always a problem.

Architecture:

Work so far:

word2vec_viberary.mov
  • Build a model using BERT and also deploy that and evaluate them against each other. In progress on the main branch.
results_2.mov

Running the project

  1. Fork/clone the repo
  2. go to the project root
  3. You'll need the embeddings file at the root of the /app/src/ repo.
  4. make build - Builds the docker image
  5. make up - Docker compose running in background
  6. make embedindexes the embeddings once the web server is running
  7. localhost:5000 - the web server

Monitoring the project

  1. make logs for logs

TODO: Redis monitoring

Repo Structure

  • src - where all the code is
    • api - Flask sever that calls the model, includes a search endpoint. Eventually will be rewritten in Go (for performance reasons)
    • training_data - generated training data
    • model - The actual BERT model. Includes data generated for generating embeddings and also the code used to generate the embeddings, on an EC2 instance. - Right now in production only BERT gets called from the API.
    • index includes an indexer which indexes embeddings generated in model into a Redis instance. Redis and the Flask app talk to each other through an app running via docker-compose and the Dockerfile for the main app instance.
    • search - performs the search calls from api
    • inout - There are some utilities such as data directory access, io operations and a separate indexer that indexes titles into Redis for easy retrieval by the application
    • notebooks - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there is this notebook, which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, (Annotated output is here.) and storing and querying those embeddings in Redis Search. This is the solution I eventually turned into the application directory structure.
  • docs - This serves and rebuilds viberary.pizza

CONTRIBUTING

Not organized enough for meaningful contributions yet but should be soon.

Relevant Literature and Bibliography

Input Data

UCSD Book Graph, with the critical part being the user-generated shelf labels.. Sample row: Note these are all encoded as strings!

{
  "isbn": "0413675106",
  "text_reviews_count": "2",
  "series": [
    "1070125"
  ],
  "country_code": "US",
  "language_code": "",
  "popular_shelves": [
    {
      "count": "2979",
      "name": "to-read"
    },
    {
      "count": "291",
      "name": "philosophy"
    },
    {
      "count": "187",
      "name": "non-fiction"
    },
    {
      "count": "80",
      "name": "religion"
    },
    {
      "count": "76",
      "name": "spirituality"
    },
    {
      "count": "76",
      "name": "nonfiction"
    }
  ],
  "asin": "",
  "is_ebook": "false",
  "average_rating": "3.81",
  "kindle_asin": "",
  "similar_books": [
    "888460",
    "734023",
    "147311",
    "219106",
    "313972",
    "238866",
    "196325",
    "200137",
    "588008",
    "112774",
    "2355135",
    "336248",
    "520437",
    "421044",
    "870160",
    "534289",
    "64794",
    "276697"
  ],
  "description": "Taoist philosophy explained using examples from A A Milne's Winnie-the-Pooh.",
  "format": "",
  "link": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "authors": [
    {
      "author_id": "27397",
      "role": ""
    }
  ],
  "publisher": "",
  "num_pages": "",
  "publication_day": "",
  "isbn13": "9780413675101",
  "publication_month": "",
  "edition_information": "",
  "publication_year": "",
  "url": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "image_url": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png",
  "book_id": "89371",
  "ratings_count": "11",
  "work_id": "41333541",
  "title": "The Te Of Piglet",
  "title_without_series": "The Te Of Piglet"
}

Embeddings Sample

Screen Shot 2023-02-18 at 2 10 15 PM

More Repositories

1

what_are_embeddings

A deep dive into embeddings starting from fundamentals
Jupyter Notebook
925
star
2

textedit

A super-mini Python text editor
Python
78
star
3

soviet-art-bot

A bot that finds tweets socialist realism paintings. v. 0.20
Python
71
star
4

favorite_essays

Updating list of favorite internet essays
47
star
5

til

Today I Learned Some Computer Stuff
39
star
6

hustlr

A web app for HN hustlers
HTML
29
star
7

boringml

Boring ML Generated Site
19
star
8

data

Scripts to manipulate data
Python
15
star
9

datascientistwiki

Wiki of links and data science resources started in datascientists.slack.com
14
star
10

markovhn

Creating Markov chain-generated Hacker News headlines with Python
Python
12
star
11

venti-pytorch

Model for serving venti
Python
10
star
12

caffeine

A tiny, simple Java static site generator
HTML
8
star
13

venti

Python
8
star
14

gandinsky

Fooling around with neural nets and art
Jupyter Notebook
7
star
15

intro-to-sql

Girl Develop It Intro to SQL
JavaScript
7
star
16

swedish-house-ml

A project examining the relationship between nudity in cover art and social media response to music
Jupyter Notebook
6
star
17

data-lake-talk

Slides and code for Data Philly Data Lake Talk
JavaScript
6
star
18

ml-garden

Personal Learning Mind Map
HTML
5
star
19

veekaybee.github.io

Tech blog
HTML
4
star
20

data-lake-code

Code for the Data Lake Talk
Python
4
star
21

recsys-bracket

Recsys March Madness bracket
CSS
4
star
22

wordcloud

Generating wordclouds from Strata conference talks for a blog post
HTML
4
star
23

slatin

A simple transliterator from the Roman alphabet to Cyrillic.
HTML
3
star
24

strata_schedule

Playing with ics in Python
Python
3
star
25

whoshiring

Who's Hiring February 2016
Python
3
star
26

cumtotal

Cumulative totals a couple different ways: R, Python, SQL, etc.
R
3
star
27

viberary_model

ONNX Model for Viberary
3
star
28

nisaba

Telegram Bookmark bot
Python
2
star
29

normcoretech

Website
CSS
2
star
30

hadoop

Anything and everything related to Hadoop
Python
2
star
31

dailyprogrammer

Reddit daily programmer challenge solutions https://www.reddit.com/r/dailyprogrammer/
Python
2
star
32

algorithms

Grokking Algorithms
Python
2
star
33

wired

Wired data for veekaybee.github.io
Python
2
star
34

data-jawn

Data Jawn Keynote 2018
2
star
35

sparkr-examples

Spark R post code
R
1
star
36

pythondatastructures

Python Data Structures on Coursera
Python
1
star
37

latex_resources

Latex resources
HTML
1
star
38

senior-dev-day-talk

Senior Dev Day Talk Slides
JavaScript
1
star
39

javahard

Learn Java the Hard Way
Java
1
star
40

testymctest

1
star
41

jumbotron

Main webpage static site
HTML
1
star
42

dijkstra

Quick graph traversal
Java
1
star
43

wlb

Porting blog from Wordpress
HTML
1
star
44

hugo-test

blog migration
HTML
1
star
45

veekaybee

About
1
star
46

spark-calc

Spark memory settings calculator
HTML
1
star
47

priorityqueue

Priority Queue Reference Implementation
Java
1
star
48

venvcheat

Venv activation cheatsheet
HTML
1
star
49

scalaBlog

bootstrapped Scala blog for Scala Learnings
HTML
1
star
50

tualerts

Pulling 3 years' worth of emails and analyzing from Temple University Campus Alerts
Python
1
star
51

cis111b_project

Java
1
star