• Stars
    star
    1,472
  • Rank 31,940 (Top 0.7 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 3 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Universal personal search engine, powered by a full text search algorithm written in pure Ink, indexing Linus's blogs and private note archives, contacts, tweets, and over a decade of journals.

Monocle 👓

Monocle is my universal, personal search engine. It can query across tens of thousands of documents from my blog posts, journal entries, notes, Tweets, contacts, and more to act as my extended memory spanning my entire life. Monocle is designed with a focus on speed, privacy, and hackability. It's built to be very specific to the particulars of my personal workflow around data, so probably won't work for anyone else. But I might build something similar aimed for the public later. I've written more in depth about this project on my blog, and you can try a demo indexed on the public subset of my full data at monocle.surge.sh.

Screenshots of Monocle running on iPads

Like most of my side projects, Monocle is built entirely with the Ink programming language. The ingest and indexing pipeline is built with Ink; the application is served by an Ink web server to get some specific data compression Monocle needs, and the client web app is written in Ink with the Torus UI library.

Though it's a useful tool, Monocle was also a way for me to learn about the basics of full text search systems by writing one from scratch in Ink. As a result, its search capabilities aren't cutting edge -- they're merely good enough for my use.

Features

I first had the idea for Monocle a long time ago, when I tweeted about potentially building a search engine that only indexed my historical data. Since then, I had come back to the idea a few times but only really started building it after writing and thinking about incremental note-taking, where fast and effective recall is as important as quick entry into a good note-taking system.

Motivated by those ideas, Monocle is designed with a focus on speed and effective recall. To me, the following features were important:

  1. Quick time-to-first-result. When I open Monocle to perform a search, I either already know what I'm looking for and need to find it again, or don't know what exactly I'm looking for and need to explore the list of results. In both cases, the critical variable of speed is what I call "time to first result": the time between me having the thought to search for something, and the first search result popping up on screen.

    Optimizing time-to-first-result involves a few different kinds of design decisions and technical performance metrics. The most significant among these is that I wanted results to appear live, as you type. Ideally, results would appear on every keystroke, even on slower connections. This one constraint had many consequences. This meant, for example, that the search index had to live on the client and search and ranking needed to be performed in the browser, in JavaScript. It meant I needed to optimize the index for download size, and minimize compute required on every keystroke to perform a new search.

    The result is worth it: on a modern smartphone or laptop, Monocle can search through tens of thousands of documents as you type, on every keystroke, with minimal lag, using an index a few megabytes in size.

  2. Reasonably accurate English stemming. Monocle includes a simple, hand-written stemming algorithm for the English language. It's not smart or sophisticated, and is optimized for code size and for running on queries to expand each keyword into its variations (in that way, it's more of a reverse-stemmer) since it runs on the client rather than during indexing.

  3. Static deploy with pre-compiled index. In Monocle, the search index is an inverted index (also called a posting list) generated from the source document set at build time. This index is reused until a new index is generated, which can be done periodically. Because I didn't really have a need for ingesting documents or new notes any more frequently than a few times a week, this worked for me, and minimized work, especially on the backend where I pay for my own compute! This lets me host Monocle as a static site deployment.

  4. Keyword match highlights in both results list and preview. Although not perfect (mostly for performance reasons), the search results listing and document previews include highlights for matches to search keywords, which makes it much easier to visually parse and understand individual search results.

  5. Pluggable architecture for easily adding new data sources. One of the special things about Monocle is that it's designed to index any arbitrary data source in my life, from structured data like people in my contacts to free-form data like my journal entries. Monocle lets me write a small bit of Ink code to represent and ingest data from each data source, to add another data source to the search index.

Architecture

Monocle is a static, single page web app that runs from a pre-built index of documents. Both the indexing system and the web app itself are written in Ink. Here's a high-level architecture diagram, which will look familiar if you've built a full text search engine before.

A system diagram of the Monocle search engine

Dataset ingestion and modules

Every data source, from my Tweet archive to my notes, has a specific Monocle "module" that corresponds to it, in ./modules. A module's job is to take the source dataset of documents and transform the documents into a list of Doc objects with a specific schema that the indexer can understand. Each module exports a getDocs(callback) function that calls the callback with a list of Docs. The schema, here conveniently represented in TypeScript's notation, is:

type Doc = {
	// A globally unique identifier for this document across all Monocle
	// documents. It's usually a 2-3 letter prefix for the module (like "tw"
	// for Tweets) followed by a number.
	id: string
	// A map of each token in the document to the number of times it appears
	// in the document.
	tokens: Map<string, number>
	// The document's text content that will be displayed in the results page
	content: string
	// Optionally, the doc's title
	title?: string
	// Optionally a link to this document on the web that Monocle can use to
	// "link out" to the original document from the search result.
	href?: string
}

Each module also invokes the tokenizer from the search library to generate that list of tokens. This is performed by the module rather than later in the indexing pipeline, because some modules may want to index metadata that isn't a part of the doc's "content". For example, I may want all my Tweets to index the name of people mentioned in it, even though they aren't mentioned in the Tweet's textual content itself.

Indexing and the backend

Once the documents have been tokenized and the output generated by all the modules, they are handed to the indexer. The indexer's job is to produce the index, a big JSON file that represents an inverted index mapping each word that ever appeared in any document to all documents that contained that word. This index is cached on disk alongside the generated dataset of documents, in ./static/indexes/{doc, index}.json. They are also gzipped at build time for storage efficiency and faster downloads.

Once the index has been generated and saved, we are ready to search!

Query expansion, searching, ranking and the frontend

The Monocle web app is a single page web app driven by the Torus UI library, which Ink wraps. Once the UI code accepts a search query (on every keystroke into the search box), here's what happens:

First, the search query string (for example tools for thought) is tokenized into individual words using the same tokenizer used in the indexing pipeline. This sharing of code guarantees some consistency over how words and phrases are handled. This step gives us the list ['tools', 'thought'], since during tokenization stopwords like "for" are omitted.

Next, the tokens are sent to the stemmer. The stemmer's job is query expansion: taking a query and generating variations on each keyword to ensure we do not miss documents that may be relevant, even if they don't contain the exact keywords we typed. For example, ['tools', 'thought'] gets expanded to something like

(tools OR tool OR tooling OR toolment OR tooled OR tooler OR ...)
AND
(thought OR thoughts OR thoughted OR thoughtting OR thoughter OR ...)

This query expansion is naive and focused on English queries, but seems to work well enough for me not to miss too many results the vast majority of the time.

The expanded search query is then sent to the searcher. The searcher is the heart of the searching algorithm, but its job is rather simple: for each keyword in our query, use the index to find all documents that contain that keyword. Depending on the query's structure, the searcher either gathers the union or the intersection of these sets of documents.

Finally, the results need to be ordered by some measure of relevance before they're displayed to the user. For this, we turn to the ranker. The ranker takes all the documents returned by the searcher and ranks them according to the popular td-idf metric of keyword relevance. This ranked, ordered list of results is what you see in the results section of Monocle.

All of this action happens for every search, at every keystroke into the search field in the web app. Pretty cool!

Development and deploy

⚠️ Note: If you're reading this section to try to set up and run your own Monocle instance, I applaud your audacity, but it might not be super easy or fruitful -- Monocle's modules are pretty specific not only to my data sources, but also the way I structure those files. I won't stop you from trying to build your own search index, but be warned: it might not work, and I'm probably not going to do tech support. For this reason, this section is also written in first-person, mostly for my future reference.

Monocle, like most Ink projects, (ab)uses a Makefile for development tasks. Here are the ones I currently have:

  • Just make, or make run, will run the static file server that serves the Monocle web app, exactly as in my private production (not public-facing) deployment. This server serves docs and indexes from the gzippsd versions rather than plain JSON versions on disk.
  • make index will (re-)index documents from all modules. Each module will skip work if there's a cached JSON for it in ./static/indexes. If I want to force a module to re-generate its docs list, I can just delete that cached file.
  • make check or make t will run the unit test suite for the Monocle full text search library.
  • make format or make f formats all source files (outside of ./vendor) with inkfmt, assuming I have it installed.
  • make build commands (re-)build the frontend JavaScript bundle from Ink sources:
    • make build-libs builds the vendor bundle of dependencies and libraries. This rarely needs to be re-run.
    • make build-monocle builds the monocle full text search library into ./static/ink/monocle.js.
    • make build builds the main app code and links up the whole bundle to ink/bundle.js. This is usually what I re-run every change.
    • make build-all Runs the whole build from top to bottom.
    • make watch or make w runs make build on every Ink file change.

Generating and rebuilding indexes

Nearly all the modules, except third-party data sources mentioned below, pull data out of disk from an installation of Noct, my custom file storage and sync layer shared across many of my productivity tools. Each module depends on its data being stored at a specific path within the user's Noct directory root. You can usually find this path by reading the module's source code. Running make index will generally do the right thing here.

For third-party data modules tweets, pocket, and ideaflow, I need to pre-process the data into a JSON file, and then point these modules to those files.

  • For Tweets, I export my Tweets from Twitter using the archive / export feature, and save a JSON array of my Tweets with schema

     type Tweets = {
     	id: string // Tweet Snowflake ID
     	content: string // Tweet full_text, with expanded entities
     }[]
  • For bookmarks I've saved on Pocket, I click on "Export" under "Manage your account" in the web interface to get an HTML archive of all my bookmarked notes. I do this instead of going through the API because this is much faster than waiting for Pocket's API rate limits if I simply want to get a list of URLs, which is all I need. This functionality is also accessible at https://getpocket.com/export.

    After I have that list of URLs, I open it in the browser and run the little JavaScript snippet in modules/pocket.ink to produce a JSON of titles and links.

    Then, I run that through node modules/pocket-full-text/index.js which optionally downloads, parses, and re-saves a full-text archive of all of those pages using Mozilla's excellent Readability.js library. This is to make the full text of all of those bookmarked pages indexable in Monocle. This produces a JSON array saved to the specified destination file of the following format:

     type Page = {
     	title: string // document title + site name
     	content: string // parsed full text of the bookmarked page
     	href: string // link to the bookmarked page
     }[]

    Finally, I run the indexer in modules/pocket.ink.

  • For Ideaflow notes, I serialize my notes out to text and similarly save them into a JSON array of notes with schema

     type Notes = {
     	id: string // Note ID for deep links
     	content: string // Text-serialized note content
     }[]

Future work

Monocle is reasonably feature-complete, enough for me to use it day to day without any problems or pain points. But there are some lingering ideas I'd like to try.

  • Richer previews in the right pane, rendering basic Markdown and images
  • Support for literal match queries (e.g. "note-taking") and NOT / OR in the query

I also want to try adding these data sources as modules.

  • YouTube watch history
  • The other smaller blogs I have: dotink.co, linus.coffee

More Repositories

1

ink

Ink is a minimal programming language inspired by modern JavaScript and Go, with functional style.
Go
557
star
2

blocks.css

Add some dimension to your page with blocks 🚀
HTML
466
star
3

tabloid

A minimal programming language inspired by clickbait headlines
JavaScript
457
star
4

torus

Torus is an event-driven model-view UI framework for the web, focused on being tiny, efficient, and free of dependencies.
JavaScript
321
star
5

revery

A personal semantic search engine capable of surfacing relevant bookmarks, journal entries, notes, blogs, contacts, and more, built on an efficient document embedding algorithm and Monocle's personal search index.
JavaScript
273
star
6

unim.press

A Reddit front-page reader in the style of The New York Times.
JavaScript
244
star
7

oak

An expressive, simple, dynamic programming language.
HTML
231
star
8

polyx

Productivity suite written from scratch in Ink on the backend and Torus on the web
JavaScript
212
star
9

libsearch

Simple, index-free full-text search for JavaScript
JavaScript
159
star
10

merlot

Web based Markdown writing app built with isomorphic Ink and Torus
JavaScript
150
star
11

h12y

The email service for when just "hey.com" isn't enough.
HTML
147
star
12

modelexicon

This AI Does Not Exist: generate realistic descriptions of made-up machine learning models.
CSS
145
star
13

codeframe

The fastest, easiest way to build and deploy quick static webpages
JavaScript
127
star
14

draw

Real-time collaborative whiteboard on the web
JavaScript
126
star
15

inc

A note-taking tool based on the principles of incremental note-taking, designed for quickly capturing fleeting ideas and growing a knowledge base over time.
Makefile
126
star
16

histools

A collection of tools for generating data visualizations from browser history data
JavaScript
125
star
17

mira

A place for notes, but for the people I keep in touch with
JavaScript
116
star
18

tinyhumans

A little interactive sandbox for tiny people, tiny thoughts, and their tiny stories
HTML
116
star
19

lucerne

A Twitter reader designed for learning from the Twittersphere, built with Ink and Torus
JavaScript
115
star
20

calamity

Self-hosted GPT playground
CSS
110
star
21

stream

A Twitter-like micro-blog for personal project updates and snippets of thought, written in Oak
CSS
83
star
22

burds

Just some burds, jumpin' around in their own little world.
JavaScript
81
star
23

thingboard

A board of things, anywhere you want on the screen
JavaScript
67
star
24

ycvibecheck

Semantic search across every YC company ever. Vibe check your idea?
CSS
57
star
25

lovecroft

Minimal mailing list manager for static sites, with a simple JSON API
Go
57
star
26

frieden

My personal, read-only public availability calendar
JavaScript
55
star
27

paper.css

Lightweight, modern CSS to add some flair to your web-things 📜
HTML
41
star
28

klisp

A Lisp written in about 200 lines of Ink, featuring an interactive literate programming notebook
JavaScript
38
star
29

superstat

Git status + diff across every repo in a directory
Makefile
35
star
30

typogram

Small, minimalistic graphics for powerful ideas in a few words
JavaScript
34
star
31

plume

Small in-memory real time chat server with Go and WebSockets
Go
33
star
32

september

Ink to JavaScript compiler and toolchain, written in Ink itself
JavaScript
29
star
33

yolo

On the yolo page, anything goes... I'll merge any pull request you make to this website.
HTML
28
star
34

x-oak-notebook

Experimental tool for writing dynamic Markdown docs that embed interactive explorable visualizations
HTML
28
star
35

maverick

Web IDE and REPL for the Ink programming language, written in pure Ink on a self-hosted compiler toolchain
JavaScript
27
star
36

pico

Lightweight notepad for ephemeral memos, todos, meeting notes, and more
JavaScript
26
star
37

zone

A URL shortener / note sharing service.
JavaScript
25
star
38

kin

A refined tool for exploring open-source projects on GitHub with a file tree, rich Markdown and image previews, multi-pane multi-tab layouts and first-class support for Ink syntax highlighting.
JavaScript
24
star
39

august

Assembler from scratch written in Ink, supporting ELF on x86_64 and more.
Assembly
23
star
40

codesynth

Generate music from your source code 🎹
JavaScript
23
star
41

albatross

A simple to-do list app
CSS
23
star
42

spectre

Sparse autoencoders for Contra text embedding models
Jupyter Notebook
23
star
43

cornelia

Guess that Taylor Swift line <3
JavaScript
22
star
44

hfm

Hugging Face Download (Cache) Manager
Makefile
21
star
45

sistine

A simple, flexible, productive static site generator written entirely in Ink
HTML
21
star
46

shelf.page

An online, public “blog-shelf” for collecting and sharing interesting reads with your audience. What's on your digital shelf?
JavaScript
21
star
47

clozoom

Close your Zoom meeting tabs automatically
JavaScript
20
star
48

carlisle

A minimal template for a Hugo site
CSS
20
star
49

zerotocode

The best place on the web to learn to make stuff with code
HTML
19
star
50

litterate

Generate beautiful literate programming-style description of your code from comment annotations
JavaScript
19
star
51

xin

Xin (신/心) is a flexible functional programming language with a tiny core, inspired by Lisp and CSP
Go
19
star
52

animated-value

Imperative animation API for declarative UI renderers, like React, Preact, and Torus.
JavaScript
18
star
53

schrift

A more experimental runtime for Ink, focused on perf and instrumentation
Rust
17
star
54

rush

Rush lets you work on many files at once
Makefile
17
star
55

eliza

A modern port of the ELIZA conversational program to pure Ink to run as a command line and in the browser.
CSS
15
star
56

entr

A searchable repository of my personal notes from readings
JavaScript
15
star
57

sounds

A collection of sounds from places I've been
JavaScript
15
star
58

micropress

An Ink library for automatic text summarization
14
star
59

dotink

dotink (.ink) is the Ink programming language's blog, and my general technical blog
HTML
14
star
60

matisse

Gallery of generative art written with Ink
HTML
12
star
61

socialite

Fast social sharing metadata tag generator
JavaScript
12
star
62

traceur

Experimental pathtracing 3D renderer written in Ink
Makefile
12
star
63

tsqdm

TQDM for TypeScript / Deno
TypeScript
12
star
64

xi

A dynamic, stack-based concatenative toy programming language.
Logos
11
star
65

ink-vscode

Support for the Ink programming language in Visual Studio Code
10
star
66

x-oak-klisp

A Klisp (scheme-like flavor of Lisp) implementation in Oak
Vim Script
9
star
67

wintermute

Generating fake blog posts from my blog with a Markov chain
Go
8
star
68

looking-glass

A simple web screenshot API using Puppeteer
JavaScript
8
star
69

lambda

The untyped lambda calculus, implemented in Ink
8
star
70

etch

Dead simple project scaffolding for my commonly used layouts
CSS
7
star
71

oak-syntax-visualizer

Oak syntax visualizer, made for GopherCon 2021
CSS
7
star
72

vanta

Port of thesephist/klisp to pure Go
Go
6
star
73

dotfiles

Config, scripts, rc files 💻
JavaScript
5
star
74

ky

A modal text editor
Go
5
star
75

rational-arithmetic

A no-dependency, lightweight JS library for arithmetic with rational numbers
JavaScript
5
star
76

inkfmt

Code formatter for the Ink programming language
Shell
4
star
77

markovify

Using Markov chains to naively generate sequences of words from training samples.
JavaScript
4
star
78

nought

Personal people-manager, what some people might call a personal CRM
HTML
3
star
79

dessi

A quick, simple server-side-includes expander
JavaScript
3
star
80

papyrus

Small, static-site for hosting read-optimized content, like stories or e-books
HTML
3
star
81

codeliner

Generate codelines: like silhouette outlines, but for your source code
3
star
82

inker

Web API to run Ink code on any device 💻
JavaScript
3
star
83

state-of-startups-bearx

BearX's State of Startups Report
HTML
3
star
84

web-audio-workshop-2020

Web Audio API Workshop for Hack the Fog and hackswiftly 2020
JavaScript
3
star
85

send-tweet

Small Ink program to send tweets using the Twitter JSON API
2
star
86

korona

Take any JavaScript data and get back a reasonably unique hex or rgb color with an optional alpha channel 🖌
JavaScript
2
star
87

strat

Minimal framework for futures, options, and cryptocurrency investments built on the Robinhood (private) API
JavaScript
2
star
88

brandish

Visual branding for humans
JavaScript
2
star
89

hurricane

Zero-configuration, read-only JSON API proxy in front of an Airtable base
Go
2
star
90

traceur-web

Web and JavaScript port of thesephist/traceur
JavaScript
2
star
91

ittr

Small library of iterator-related utility functions for JavaScript
JavaScript
2
star
92

ansi.ink

Ink library for printing with ANSI escape sequences
1
star
93

pandora

A small, HTML5 quiz SPA
JavaScript
1
star
94

generator-vanilla-extension

Yeoman generator for simple chrome extensions
JavaScript
1
star
95

thesephist.github.io

Placeholder for personal website
HTML
1
star
96

pyro

Check if any of the routes in a list of critical routes of an app are failing
Go
1
star
97

sigil

My full-time to-do list and task manager 🔥
JavaScript
1
star
98

notepad

Short bash script to pull up $EDITOR
Shell
1
star
99

talaria

Inward-out gesture recognition in a wearable
Python
1
star
100

blocky-logos

Blocky logos, an exploratory creative project about logo and branding 🌁
HTML
1
star