• Stars
    star
    273
  • Rank 150,295 (Top 3 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 3 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A personal semantic search engine capable of surfacing relevant bookmarks, journal entries, notes, blogs, contacts, and more, built on an efficient document embedding algorithm and Monocle's personal search index.

Revery 🦅

Revery is a semantic search engine that operates on my Monocle search index. While Revery lets me search through the same database of tens of thousands of notes, bookmarks, journal entries, Tweets, contacts, and blog posts as Monocle, Revery's focus is not on keyword-based search that Monocle performs, but instead on semantic search -- finding results that are topically similar to some given web page or query, even if they don't share the same words. It's available as a browser extension that can surface relevant results to the current page, as well as a more standard web app resembling Monocle's search page.

Revery's browser extension and web interface running on an iPad and a laptop

Unlike most of my side projects, because of the size of data and amount of computational work Revery requires, its backend is written in Go. Both clients -- the web app and the browser extension -- are built with Torus.

Although it works well enough for me to use it every day, Revery is more of a proof-of-concept prototype than a finished product. I wanted to demonstrate that a tool like this could be built for personal use on top of personal productivity tools like notes and bookmarks, and experience what it would feel like to browse the web and write with such a tool.

Features

Revery, at its core, is just a single API. The API takes in some text, and crawls through my collection of personal documents and notes to find the top ones that seem most topically related to the given text. To make this interesting to use, I've wrapped it up in two different interfaces: a browser extension, and a more standard web-based search interface.

Browser extension

The Revery browser extension lives inside ./extension in this repository, and does exactly one thing: when I hit Ctrl-Shift-L on any webpage I'm viewing, it'll scrape the main body of text from the page (or some selected part of it, if I've highlighted something) and talk to the Revery API to find the documents that are most related to what I'm reading.

Revery's browser extension showing a list of related results

Where Monocle, with its keyword-based search algorithm, is good for recollection, I've found the Revery extension great for explorations on a specific topic. If I'm reading about natural language processing, for example, I can hit a few keystrokes to bring up other articles I've read, or notes I've taken in the past, that I can mentally reference as I read and learn about new ideas in NLP.

We learn new ideas best when we can find existing reference points in our memory onto which we can attach new information. Revery's extension partly automates and speeds up that task. For example, while reading an article about South Korea's unique cultural and economic position in the world, Revery surfaced a few related newsletters and articles from completely different authors and sources on Korean pop culture and its population decline, which helped me frame what I was reading in a much more broad, well-informed context.

Web interface

The web search interface, to me, is a bit secondary to the extension. It exists primarily as a demonstration of Revery's underlying technology, and also incidentally as a way for me to use Revery when the extension isn't available (like on a mobile browser).

Revery's web interface showing a list of results

The search bar in the web interface can take either a URL or some key phrase. Given a URL (as in the screenshot above), Revery will download and read the web page itself to find related documents in the search index. Given a key phrase, Revery will try to suggest documents that contain similar words and speak on similar topics.

This kind of a search interface (as opposed to the extension) is useful to me for starting out thinking about something new, where I can type in a list of related words into the search box and immediately get a list of ideas and documents I'm familiar with that are related, without having to fashion the specific and well-crafted search queries that keyword-based search engines like Monocle require.

How it works

As mentioned above, Revery's core is a single API endpoint that takes in some document and returns a list of most related documents from my search index. What makes Revery special is that this API performs a semantic search, not simply a scan for matching keywords. This means that the top results may not even contain the same words as the query, as long as its contents are topically relevant.

This kind of semantic search is enabled by a search algorithm that uses cosine similarity to cluster document embeddings of the indexed documents. If that sounds like a bunch of random words to you (as it did to me when I started this project), let me break it down:

First, we'll need to understand word embeddings. A word embedding is a way of mapping a vocabulary of natural language words to some points in space (usually a high-dimensional mathematical space), such that words that are similar in meaning are close together in this space. For example, the word "science" in a word embedding would be very close to the word "scientist", reasonably close to "research", and likely very far from "circus". When we talk about "distance" in the context of word embeddings, we usually use cosine similarity rather than Euclidean distance, for both empirical and theoretical reasons I won't cover here.

Although the concept of word embeddings is not very new, there is still active research producing new methods for generating more and more accurate and useful word embeddings from the same corpus of data. My personal deployment of Revery uses the Creative Commons-licensed word embedding dataset produced by Facebook's FastText tool, specifically a 50,000-word dataset with 300 dimensions trained on the Common Crawl corpus.

Word embeddings let us draw inferences about which words are related, but for Revery, we want to draw the same kind of inference about documents, which are a list of words. Thankfully, there's ample literature to suggest that simply taking a weighted average of word vectors for every word in a document can get us a good approximation of a "document vector" that represents the document as a whole. Though there are more advanced methods we can use, like paragraph vectors or models that take word order into account like BERT, averaging word vectors works well enough for Revery's use cases, and is simple to implement and test, so Revery sticks with this approach.

Once we can generate document vectors out of documents using our word embedding, the rest of the algorithm falls into place. On startup, Revery's API server indexes and generates document vectors for all of the documents it can find in my dataset (which isn't too large -- around 25,000 at time of writing), and on every request, the algorithm computes a document vector for the requested document, and sorts every document in the search index by its cosine distance to the query document, to return some top n results.

Within Revery, every part of this algorithm is hand-written in Go. This is for a few reasons:

  1. I wanted to encourage myself to understand these basic algorithms of the trade fully, by writing the code myself
  2. Most open-source libraries to do this kind of computation are made available in Python packages, and I don't have great personal infrastructure for deploying and maintaining a Python application.
  3. Go is fast enough, anecdotally, for this task.

Both of the clients of Revery -- the extension and the web app -- talk to this single API endpoint. The clients themselves are quite ordinary, so I won't go into detail describing how they work here.

Development and deploy

Here, the same disclaimer that I shared with Monocle also apply:

⚠️ Note: If you're reading this section to try to set up and run your own Revery instance, I applaud your audacity, but it might not be super easy or fruitful -- Revery's setup (especially on the data and indexing side) is pretty specific not only to my data sources, but also the way I structure those files. I won't stop you from trying to build your own search index, but be warned: it might not work, and I'm probably not going to do tech support. For this reason, this section is also written in first-person, mostly for my future reference.

Revery depends on the search index produced by Monocle's indexer, so I usually make sure Revery has a recent copy of Monocle's search index available before running.

Revery has two independent codebases in the same repository. The first is the Chrome extension, which lives entirely inside the ./extension folder. Here's how I set it up:

  1. The extension needs an API authentication token to talk to the Revery API. I usually just choose an arbitrarily long random string. Then, I place a file in ./extension called token.js with the content:

    const REVERY_TOKEN = '<some API key here>';
  2. I go to chrome://extensions and click "Load unpacked" to load the ./extension folder as an "unpacked extension" into my browser, which will make the extension available in every tab.

That's it for the extension setup. Next, I set up the server:

  1. Take the same authentication token from above, and place just the token string itself inside tokens.txt in the root of the project folder. The Revery server will grab the whitespace-trimmed content of this file and use it as the API key.
  2. Simply running make will build the revery binary executable into the project folder.
  3. Revery needs two extra sets of data to work: the word embedding model, and Monocle's document dataset.
    • Download a word embedding file (for example, from FastText) and trim it to some reasonable size (top 50-100k words seems to work well). Trim the first line, which usually indicates the total word count and number of dimensions. Revery's code assumes 300 dimensions, so if this is not the case, revise the code.
    • Copy Monocle's docs.json document dataset generated by the indexer to ./corpus/docs.json.
  4. Running the revery executable now should correctly pre-process the model and search index, and start the web application server.

Prior art and future work

Although Revery is useful enough for me to use day to day, There's a lot of active research in the general natural language search space, and Revery itself has a lot of room for improvements.

On the data side:

  • Experimenting with other word embeddings which may provide better performance. I've tried FastText and LexVec, but there are many other open models available.
  • Generating a custom word embedding optimized for my dataset and for use in forming document vectors

On the code side:

  • Optimizing the algorithms that touch data to scale better, using some amount of caching and good old fashioned hand optimization of the code
  • Better ways to surface documents contextually in the browser. Right now, searching Revery within a browser requires an explicit user action. Perhaps we can surface them completely automatically, or even detect when a user has scrolled to the end of a page or highlighted an interesting section of the document to automatically suggest related documents.
  • Better ways to balance the benefits of keyword-based and semantic search. Right now, Monocle and Revery are two completely separate applications, but having both kinds of search collaborating with each other or even simply displaying side by side on screen may be more useful.

There is also plenty of great prior art in this space. Though I can't list them all here, there are a few that stand out as inspirations for Revery.

  • Monocle, the direct predecessor to Revery that uses the same dataset for keyword search
  • same.energy, which enables searching for tweets or photos of the same "style" using a transformer model
  • Semantica, which uses word embeddings to provide a lower-level tool to explore relationships between individual words and concepts
  • Tyler Angert's Information forest, an imaginative note about web browsers of the future
  • Document embedding techniques, which served as a useful overview of the field when I began this project

More Repositories

1

monocle

Universal personal search engine, powered by a full text search algorithm written in pure Ink, indexing Linus's blogs and private note archives, contacts, tweets, and over a decade of journals.
JavaScript
1,472
star
2

ink

Ink is a minimal programming language inspired by modern JavaScript and Go, with functional style.
Go
557
star
3

blocks.css

Add some dimension to your page with blocks 🚀
HTML
466
star
4

tabloid

A minimal programming language inspired by clickbait headlines
JavaScript
457
star
5

torus

Torus is an event-driven model-view UI framework for the web, focused on being tiny, efficient, and free of dependencies.
JavaScript
321
star
6

unim.press

A Reddit front-page reader in the style of The New York Times.
JavaScript
244
star
7

oak

An expressive, simple, dynamic programming language.
HTML
231
star
8

polyx

Productivity suite written from scratch in Ink on the backend and Torus on the web
JavaScript
212
star
9

libsearch

Simple, index-free full-text search for JavaScript
JavaScript
159
star
10

merlot

Web based Markdown writing app built with isomorphic Ink and Torus
JavaScript
150
star
11

h12y

The email service for when just "hey.com" isn't enough.
HTML
147
star
12

modelexicon

This AI Does Not Exist: generate realistic descriptions of made-up machine learning models.
CSS
145
star
13

codeframe

The fastest, easiest way to build and deploy quick static webpages
JavaScript
127
star
14

draw

Real-time collaborative whiteboard on the web
JavaScript
126
star
15

inc

A note-taking tool based on the principles of incremental note-taking, designed for quickly capturing fleeting ideas and growing a knowledge base over time.
Makefile
126
star
16

histools

A collection of tools for generating data visualizations from browser history data
JavaScript
125
star
17

mira

A place for notes, but for the people I keep in touch with
JavaScript
116
star
18

tinyhumans

A little interactive sandbox for tiny people, tiny thoughts, and their tiny stories
HTML
116
star
19

lucerne

A Twitter reader designed for learning from the Twittersphere, built with Ink and Torus
JavaScript
115
star
20

calamity

Self-hosted GPT playground
CSS
110
star
21

stream

A Twitter-like micro-blog for personal project updates and snippets of thought, written in Oak
CSS
83
star
22

burds

Just some burds, jumpin' around in their own little world.
JavaScript
81
star
23

thingboard

A board of things, anywhere you want on the screen
JavaScript
67
star
24

ycvibecheck

Semantic search across every YC company ever. Vibe check your idea?
CSS
57
star
25

lovecroft

Minimal mailing list manager for static sites, with a simple JSON API
Go
57
star
26

frieden

My personal, read-only public availability calendar
JavaScript
55
star
27

paper.css

Lightweight, modern CSS to add some flair to your web-things 📜
HTML
41
star
28

klisp

A Lisp written in about 200 lines of Ink, featuring an interactive literate programming notebook
JavaScript
38
star
29

superstat

Git status + diff across every repo in a directory
Makefile
35
star
30

typogram

Small, minimalistic graphics for powerful ideas in a few words
JavaScript
34
star
31

plume

Small in-memory real time chat server with Go and WebSockets
Go
33
star
32

september

Ink to JavaScript compiler and toolchain, written in Ink itself
JavaScript
29
star
33

yolo

On the yolo page, anything goes... I'll merge any pull request you make to this website.
HTML
28
star
34

x-oak-notebook

Experimental tool for writing dynamic Markdown docs that embed interactive explorable visualizations
HTML
28
star
35

maverick

Web IDE and REPL for the Ink programming language, written in pure Ink on a self-hosted compiler toolchain
JavaScript
27
star
36

pico

Lightweight notepad for ephemeral memos, todos, meeting notes, and more
JavaScript
26
star
37

zone

A URL shortener / note sharing service.
JavaScript
25
star
38

kin

A refined tool for exploring open-source projects on GitHub with a file tree, rich Markdown and image previews, multi-pane multi-tab layouts and first-class support for Ink syntax highlighting.
JavaScript
24
star
39

august

Assembler from scratch written in Ink, supporting ELF on x86_64 and more.
Assembly
23
star
40

codesynth

Generate music from your source code 🎹
JavaScript
23
star
41

albatross

A simple to-do list app
CSS
23
star
42

spectre

Sparse autoencoders for Contra text embedding models
Jupyter Notebook
23
star
43

cornelia

Guess that Taylor Swift line <3
JavaScript
22
star
44

hfm

Hugging Face Download (Cache) Manager
Makefile
21
star
45

sistine

A simple, flexible, productive static site generator written entirely in Ink
HTML
21
star
46

shelf.page

An online, public “blog-shelf” for collecting and sharing interesting reads with your audience. What's on your digital shelf?
JavaScript
21
star
47

clozoom

Close your Zoom meeting tabs automatically
JavaScript
20
star
48

carlisle

A minimal template for a Hugo site
CSS
20
star
49

zerotocode

The best place on the web to learn to make stuff with code
HTML
19
star
50

litterate

Generate beautiful literate programming-style description of your code from comment annotations
JavaScript
19
star
51

xin

Xin (신/心) is a flexible functional programming language with a tiny core, inspired by Lisp and CSP
Go
19
star
52

animated-value

Imperative animation API for declarative UI renderers, like React, Preact, and Torus.
JavaScript
18
star
53

schrift

A more experimental runtime for Ink, focused on perf and instrumentation
Rust
17
star
54

rush

Rush lets you work on many files at once
Makefile
17
star
55

eliza

A modern port of the ELIZA conversational program to pure Ink to run as a command line and in the browser.
CSS
15
star
56

entr

A searchable repository of my personal notes from readings
JavaScript
15
star
57

sounds

A collection of sounds from places I've been
JavaScript
15
star
58

micropress

An Ink library for automatic text summarization
14
star
59

dotink

dotink (.ink) is the Ink programming language's blog, and my general technical blog
HTML
14
star
60

matisse

Gallery of generative art written with Ink
HTML
12
star
61

socialite

Fast social sharing metadata tag generator
JavaScript
12
star
62

traceur

Experimental pathtracing 3D renderer written in Ink
Makefile
12
star
63

tsqdm

TQDM for TypeScript / Deno
TypeScript
12
star
64

xi

A dynamic, stack-based concatenative toy programming language.
Logos
11
star
65

ink-vscode

Support for the Ink programming language in Visual Studio Code
10
star
66

x-oak-klisp

A Klisp (scheme-like flavor of Lisp) implementation in Oak
Vim Script
9
star
67

wintermute

Generating fake blog posts from my blog with a Markov chain
Go
8
star
68

looking-glass

A simple web screenshot API using Puppeteer
JavaScript
8
star
69

lambda

The untyped lambda calculus, implemented in Ink
8
star
70

etch

Dead simple project scaffolding for my commonly used layouts
CSS
7
star
71

oak-syntax-visualizer

Oak syntax visualizer, made for GopherCon 2021
CSS
7
star
72

vanta

Port of thesephist/klisp to pure Go
Go
6
star
73

dotfiles

Config, scripts, rc files 💻
JavaScript
5
star
74

ky

A modal text editor
Go
5
star
75

rational-arithmetic

A no-dependency, lightweight JS library for arithmetic with rational numbers
JavaScript
5
star
76

inkfmt

Code formatter for the Ink programming language
Shell
4
star
77

markovify

Using Markov chains to naively generate sequences of words from training samples.
JavaScript
4
star
78

nought

Personal people-manager, what some people might call a personal CRM
HTML
3
star
79

dessi

A quick, simple server-side-includes expander
JavaScript
3
star
80

papyrus

Small, static-site for hosting read-optimized content, like stories or e-books
HTML
3
star
81

codeliner

Generate codelines: like silhouette outlines, but for your source code
3
star
82

inker

Web API to run Ink code on any device 💻
JavaScript
3
star
83

state-of-startups-bearx

BearX's State of Startups Report
HTML
3
star
84

web-audio-workshop-2020

Web Audio API Workshop for Hack the Fog and hackswiftly 2020
JavaScript
3
star
85

send-tweet

Small Ink program to send tweets using the Twitter JSON API
2
star
86

korona

Take any JavaScript data and get back a reasonably unique hex or rgb color with an optional alpha channel 🖌
JavaScript
2
star
87

strat

Minimal framework for futures, options, and cryptocurrency investments built on the Robinhood (private) API
JavaScript
2
star
88

brandish

Visual branding for humans
JavaScript
2
star
89

hurricane

Zero-configuration, read-only JSON API proxy in front of an Airtable base
Go
2
star
90

traceur-web

Web and JavaScript port of thesephist/traceur
JavaScript
2
star
91

ittr

Small library of iterator-related utility functions for JavaScript
JavaScript
2
star
92

ansi.ink

Ink library for printing with ANSI escape sequences
1
star
93

pandora

A small, HTML5 quiz SPA
JavaScript
1
star
94

generator-vanilla-extension

Yeoman generator for simple chrome extensions
JavaScript
1
star
95

thesephist.github.io

Placeholder for personal website
HTML
1
star
96

pyro

Check if any of the routes in a list of critical routes of an app are failing
Go
1
star
97

sigil

My full-time to-do list and task manager 🔥
JavaScript
1
star
98

notepad

Short bash script to pull up $EDITOR
Shell
1
star
99

gocoa

Go bindings for the Cocoa framework to build macOS applications
Objective-C
1
star
100

talaria

Inward-out gesture recognition in a wearable
Python
1
star