Viberary

Since the project is actively in exploration and development, there are a lot of winding codepaths, experiments, and dead ends in the codebase. It is not production-grade yet for ANY definition of production.

🚧 🚧

Viberary is a search engine that will eventually recommend you books based not on genre or title, but vibe by performing semantic search across a set of learned embeddings on a dataset of books from Goodreads and their metadata.

The idea is to return book recommendations based on the vibe of the book that you put in. So you don't put in "I want science fiction", you'd but in "atmospheric, female lead, worldbuilding, funny" as a prompt, and get back a list of books. This project came out of experiences I had where recommendations for movies, TV, and music have fairly been good, but book recommendations are always a problem.

Architecture:

Work so far:

Explore the data
- Post 0: Working with the data in BigQuery
- Post 1: Working with the data in Pandas
- Post 2: Doing research with ChatGPT
Build a baseline model in Word2Vec. Released here
Deep dive on embeddings and LaTeX Resource
Deploy the baseline model to "prod" (aka a single server) and test it out. Word2Vec Demo:

word2vec_viberary.mov

Build a model using BERT and also deploy that and evaluate them against each other. In progress on the main branch.

results_2.mov

Running the project

Fork/clone the repo
go to the project root
You'll need the embeddings file at the root of the /app/src/ repo.
make build - Builds the docker image
make up - Docker compose running in background
make embedindexes the embeddings once the web server is running
localhost:5000 - the web server

Monitoring the project

make logs for logs

TODO: Redis monitoring

Repo Structure

src - where all the code is
- api - Flask sever that calls the model, includes a search endpoint. Eventually will be rewritten in Go (for performance reasons)
- training_data - generated training data
- model - The actual BERT model. Includes data generated for generating embeddings and also the code used to generate the embeddings, on an EC2 instance. - Right now in production only BERT gets called from the API.
- index includes an indexer which indexes embeddings generated in model into a Redis instance. Redis and the Flask app talk to each other through an app running via docker-compose and the Dockerfile for the main app instance.
- search - performs the search calls from api
- inout - There are some utilities such as data directory access, io operations and a separate indexer that indexes titles into Redis for easy retrieval by the application
- notebooks - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there is this notebook, which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, (Annotated output is here.) and storing and querying those embeddings in Redis Search. This is the solution I eventually turned into the application directory structure.
docs - This serves and rebuilds viberary.pizza

CONTRIBUTING

Not organized enough for meaningful contributions yet but should be soon.

Relevant Literature and Bibliography

Input Data

UCSD Book Graph, with the critical part being the user-generated shelf labels.. Sample row: Note these are all encoded as strings!

{
  "isbn": "0413675106",
  "text_reviews_count": "2",
  "series": [
    "1070125"
  ],
  "country_code": "US",
  "language_code": "",
  "popular_shelves": [
    {
      "count": "2979",
      "name": "to-read"
    },
    {
      "count": "291",
      "name": "philosophy"
    },
    {
      "count": "187",
      "name": "non-fiction"
    },
    {
      "count": "80",
      "name": "religion"
    },
    {
      "count": "76",
      "name": "spirituality"
    },
    {
      "count": "76",
      "name": "nonfiction"
    }
  ],
  "asin": "",
  "is_ebook": "false",
  "average_rating": "3.81",
  "kindle_asin": "",
  "similar_books": [
    "888460",
    "734023",
    "147311",
    "219106",
    "313972",
    "238866",
    "196325",
    "200137",
    "588008",
    "112774",
    "2355135",
    "336248",
    "520437",
    "421044",
    "870160",
    "534289",
    "64794",
    "276697"
  ],
  "description": "Taoist philosophy explained using examples from A A Milne's Winnie-the-Pooh.",
  "format": "",
  "link": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "authors": [
    {
      "author_id": "27397",
      "role": ""
    }
  ],
  "publisher": "",
  "num_pages": "",
  "publication_day": "",
  "isbn13": "9780413675101",
  "publication_month": "",
  "edition_information": "",
  "publication_year": "",
  "url": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "image_url": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png",
  "book_id": "89371",
  "ratings_count": "11",
  "work_id": "41333541",
  "title": "The Te Of Piglet",
  "title_without_series": "The Te Of Piglet"
}

veekaybee/viberary

veekaybee

Reviews

Repository Details