• Stars
    star
    208
  • Rank 188,150 (Top 4 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 1 year ago
  • Updated 21 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SemanticFinder - frontend-only live semantic search with transformers.js

SemanticFinder

Frontend-only live semantic search with transformers.js

Try the web app, install the Chrome extension or read the introduction blog post.

Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and a quantized version of sentence-transformers/all-MiniLM-L6-v2.

Catalogue

You can use super fast pre-indexed examples for really large books like the Bible or Les Misérables with hundreds of pages and search the content in less than 2 seconds 🚀. Try one of these and convince yourself:

filesize textTitle textAuthor textYear textLanguage URL modelName quantized splitParam splitType characters chunks wordsToAvoidAll wordsToCheckAll wordsToAvoidAny wordsToCheckAny exportDecimals lines textNotes textSourceURL filename
4.78 Das Kapital Karl Marx 1867 de https://do-me.github.io/SemanticFinder/?hf=Das_Kapital_c1a84fba Xenova/multilingual-e5-small True 80 Words 2003807 3164 5 28673 https://ia601605.us.archive.org/13/items/KarlMarxDasKapitalpdf/KAPITAL1.pdf Das_Kapital_c1a84fba.json.gz
2.58 Divina Commedia Dante 1321 it https://do-me.github.io/SemanticFinder/?hf=Divina_Commedia_d5a0fa67 Xenova/multilingual-e5-base True 50 Words 383782 1179 5 6225 http://www.letteratura-italiana.com/pdf/divina%20commedia/08%20Inferno%20in%20versione%20italiana.pdf Divina_Commedia_d5a0fa67.json.gz
11.92 Don Quijote Miguel de Cervantes 1605 es https://do-me.github.io/SemanticFinder/?hf=Don_Quijote_14a0b44 Xenova/multilingual-e5-base True 25 Words 1047150 7186 4 12005 https://parnaseo.uv.es/lemir/revista/revista19/textos/quijote_1.pdf Don_Quijote_14a0b44.json.gz
0.06 Hansel and Gretel Brothers Grimm 1812 en https://do-me.github.io/SemanticFinder/?hf=Hansel_and_Gretel_4de079eb TaylorAI/gte-tiny True 100 Chars 5304 55 5 9 https://www.grimmstories.com/en/grimm_fairy-tales/hansel_and_gretel Hansel_and_Gretel_4de079eb.json.gz
1.74 IPCC Report 2023 IPCC 2023 en https://do-me.github.io/SemanticFinder/?hf=IPCC_Report_2023_2b260928 Supabase/bge-small-en True 200 Chars 307811 1566 5 3230 state of knowledge of climate change https://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_LongerReport.pdf IPCC_Report_2023_2b260928.json.gz
25.56 King James Bible None en https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_24f6dc4c TaylorAI/gte-tiny True 200 Chars 4556163 23056 5 80496 https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf King_James_Bible_24f6dc4c.json.gz
11.45 King James Bible None en https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_6434a78d TaylorAI/gte-tiny True 200 Chars 4556163 23056 2 80496 https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf King_James_Bible_6434a78d.json.gz
39.32 Les Misérables Victor Hugo 1862 fr https://do-me.github.io/SemanticFinder/?hf=Les_Misérables_2239df51 Xenova/multilingual-e5-base True 25 Words 3236941 19463 5 74491 All five acts included https://beq.ebooksgratuits.com/vents/Hugo-miserables-1.pdf Les_Misérables_2239df51.json.gz
0.46 REGULATION (EU) 2023/138 European Commission 2022 en https://do-me.github.io/SemanticFinder/?hf=REGULATION_(EU)_2023_138_c00e7ff6 Supabase/bge-small-en True 25 Words 76809 424 5 1323 https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32023R0138&qid=1704492501351 REGULATION_(EU)_2023_138_c00e7ff6.json.gz
0.07 Universal Declaration of Human Rights United Nations 1948 en https://do-me.github.io/SemanticFinder/?hf=Universal_Declaration_of_Human_Rights_0a7da79a TaylorAI/gte-tiny True \nArticle Regex 8623 63 5 109 30 articles https://www.un.org/en/about-us/universal-declaration-of-human-rights Universal_Declaration_of_Human_Rights_0a7da79a.json.gz

Import & Export

You can create indices yourself with one two clicks and save them. If it's something private, keep it for yourself, if it's a classic book or something you think other's might be interested in consider a PR on the Huggingface Repo or get in touch with us. Book requests are happily met if you provide us a good source link where we can do copy & paste. Simply open an issue here with [Book Request] or similar or contact us.

It goes without saying that no discriminating content will be tolerated.

Installation

Clone the repository and install dependencies with

npm install

Then run with

npm run start

If you want to build instead, run

npm run build

Afterwards, you'll find the index.html, main.css and bundle.js in dist.

Browser extension

Download the Chrome extension from Chrome webstore and pin it. Right click the extension icon for options:

  • choose distiluse-base-multilingual-cased-v2 for multilingual usage (default is English-only)
  • set a higher number for min characters to split by for larger texts

Local build

If you want to build the browser extension locally, clone the repo and cd in extension directory then run:

  • npm install
  • npm run build for a static build or
  • npm run dev for the auto-refreshing development version
  • go to Chrome extension settings with chrome://extensions
  • select Load Unpacked and choose the build folder
  • pin the extension in Chrome so you can access it easily. If it doesn't work for you, feel free to open an issue.

Speed

Tested on the entire book of Moby Dick with 660.000 characters ~13.000 lines or ~111.000 words. Initial embedding generation takes 1-2 mins on my old i7-8550U CPU with 1000 characters as segment size. Following queries take only ~2 seconds! If you want to query larger text instead or keep an entire library of books indexed use a proper vector database instead.

Features

You can customize everything!

  • Input text & search term(s)
  • Hybrid search (semantic search & full-text search)
  • Segment length (the bigger the faster, the smaller the slower)
  • Highlight colors (currently hard-coded)
  • Number of highlights are based on the threshold value. The lower, the more results.
  • Live updates
  • Easy integration of other ML-models thanks to transformers.js
  • Data privacy-friendly - your input text data is not sent to a server, it stays in your browser!

Usage ideas

  • Basic search through anything, like your personal notes (my initial motivation by the way, a huge notes.txt file I couldn't handle anymore)
  • Remember peom analysis in school? Often you look for possible Leitmotifs or recurring categories like food in Hänsel & Gretel

Future ideas

  • One could package everything nicely and use it e.g. instead of JavaScript search engines such as Lunr.js (also being used in mkdocs-material).
  • Integration in mkdocs (mkdocs-material) experimental:
    • when building the docs, slice all .md-files in chunks (length defined in mkdocs.yaml). Should be fairly large (>800 characters) for lower response time. It's also possible to build n indices with first a coarse index (mabye per document/ .md-file if the used model supports the length) and then a rfined one for the document chunks
    • build the index by calculating the embeddings for all docs/chunks
    • when a user queries the docs, a switch can toggle (fast) full-text standard search (atm with lunr.js) or experimental semantic search
    • if the latter is being toggled, the client loads the model (all-MiniLM-L6-v2 has ~30mb)
    • like in SemanticFinder, the embedding is created client-side and the cosine similarity calculated
    • the high-scored results are returned just like with lunr.js so the user shouldn't even notice a differenc ein the UI
  • Electron- or browser-based apps could be augmented with semantic search, e.g. VS Code, Atom or mobile apps.
  • Integration in personal wikis such as Obsidian, tiddlywiki etc. would save you the tedious tagging/keywords/categorisation work or could at least improve your structure further
  • Search your own browser history (thanks @Snapdeus)
  • Integration in chat apps
  • Allow PDF-uploads (conversion from PDF to text)
  • Integrate with Speech-to-Text whisper model from transformers.js to allow audio uploads.
  • Thanks to CodeMirror one could even use syntax highlighting for programming languages such as Python, JavaScript etc.

Logic

Transformers.js is doing all the heavy lifting of tokenizing the input and running the model. Without it, this demo would have been impossible.

Input

  • Text, as much as your browser can handle! The demo uses a part of "Hänsel & Gretel" but it can handle hundreds of PDF pages
  • A search term or phrase
  • The number of characters the text should be segmented in
  • A similarity threshold value. Results with lower similarity score won't be displayed.

Output

  • Three highlighted string segments, the darker the higher the similarity score.

Pipeline

  1. All scripts are loaded. The model is loaded once from HuggingFace, after cached in the browser.
  2. A user inputs some text and a search term or phrase.
  3. Depending on the approximate length to consider (unit=characters), the text is split into segments. Words themselves are never split, that's why it's approximative.
  4. The search term embedding is created.
  5. For each segment of the text, the embedding is created.
  6. Meanwhile, the cosine similarity is calculated between every segment embedding and the search term embedding. It's written to a dictionary with the segment as key and the score as value.
  7. For every iteration, the progress bar and the highlighted sections are updated in real-time depending on the highest scores in the array.
  8. The embeddings are cached in the dictionary so that subsequent queries are quite fast. The calculation of the cosine similarity is fairly speedy in comparison to the embedding generation.
  9. Only if the user changes the segment length, the embeddings must be recalculated.

Collaboration

PRs welcome!

To Dos (no priorization)

  • similarity score cutoff/threshold
  • add option for more highlights (e.g. all above certain score)
  • add stop button
  • MaterialUI for input fields or proper labels
  • create a demo without CDNs
  • separate one html properly in html, js, css
  • add npm installation
  • option for loading embeddings from file or generally allow sharing embeddings in some way
  • simplify chunking function so the original text can be loaded without issues
  • improve the color range
  • rewrite the cosine similarity function in Rust, port to WASM and load as a module for possible speedup (experimental)
  • UI overhaul
  • polish code
  • - jQuery/vanilla JS mixed
  • - clean up functions
  • - add more comments
  • add possible use cases
  • package as a standalone application (maybe with custom model choice; to be downloaded once from HF hub, then saved locally)
  • possible integration as example in transformers.js homepage

Star History

Star History Chart

Gource Map

image

Gource image created with:

gource -1280x720 --title "SemanticFinder" --seconds-per-day 0.03 --auto-skip-seconds 0.03 --bloom-intensity 0.5 --max-user-speed 500 --highlight-dirs --multi-sampling --highlight-colour 00FF00  

More Repositories

1

fast-instagram-scraper

A fast Instagram Scraper based on Torpy.
Jupyter Notebook
33
star
2

qdrant-frontend

A universal Qdrant table frontend based on transformers.js
JavaScript
16
star
3

gis-chat

Homepage for gis.chat
JavaScript
8
star
4

LBSN-Dashboard

A location-based social network dashboard for privacy-aware analysis
JavaScript
7
star
5

Simple-Instagram-Scraper

A highly customizable Instagram Scraper.
Jupyter Notebook
5
star
6

wikipedia-locations

HTML
5
star
7

LBSN-Thesis

Supplementary repository for my Master's thesis. Mainly for code snippets, scripts and SQL-commands for analysis of LBSN.
JavaScript
4
star
8

embedding-algebra

Test scripts for common word embedding falsehoods like King - Man + Woman = Queen for state-of-the-art embedding model
Jupyter Notebook
4
star
9

SDG-Analyzer

Frontend-only semantic similarity mapper for SDGs
CSS
3
star
10

d3wordcloud

A simple Jupyter (Lab/Notebook) wrapper of Jason Davies d3 JS wordcloud generator https://www.jasondavies.com/wordcloud/
Python
3
star
11

instagreens-bonn

Scripts and files for https://geo.rocks/instagreens/
Jupyter Notebook
3
star
12

js-camera-capture

Camera capture with getUserMedia() in vanilla JS for desktop and mobile browsers
HTML
3
star
13

trending-huggingface-models

Notifications and ready-to-use tables with trending feature-extraction models for downstream applications using transformers.js
HTML
3
star
14

copernicus-services-semantic-search

A basic semantic search app based on 834 entries from Copernicus Services Catalogue
Jupyter Notebook
3
star
15

Simple-Map-and-Weather-Webapp

Simple flask webapp showing map and weather for any city
HTML
2
star
16

MODIS

Useful R scripts for MODIS data handling
R
2
star
17

speechmap

A simple leaflet map with voice control
HTML
2
star
18

semantic-hexbins

A light-weight demo app for geospatial semantic search. Designed for text data with geospatial references.
JavaScript
2
star
19

cordis-semantic-search

A simple semantic search application for CORDIS running entirely in the browser
JavaScript
2
star
20

deckgl-label-collision

A scatterplot example with deck.gl and dynamic label collision settings
JavaScript
2
star
21

protomaps-example

Simple Leaflet example for Protomaps with custom bbox
HTML
2
star
22

simple-map-and-weather-app-2.0

Updated version with css. Nicer GUI.
HTML
1
star
23

qdrant-tutorial

A small repo for testing qdrant with embeddings and geospatial data
Jupyter Notebook
1
star
24

Test

1
star
25

Batch-Web-Harvest

Batch download for all files from a web directory fulfilling a condition
R
1
star
26

metanorma

Metanorma Test Repository
HTML
1
star
27

SimpleSiteAnalysis

An easy site analysis program in R
R
1
star
28

test-mkdocs

Test
1
star
29

timeclocker

A simple python timeclocker tool to keep track of your working hours.
Python
1
star
30

Test2

Testrepo
1
star
31

Insta-Analysis

Useful preprocessing and analysis scripts for Instagram post metadata
1
star
32

SemanticFinder-IPCC

1
star
33

mkdocs-svgbug

1
star
34

play-the-radio

Speech recognition for playing the radio in 12 lines of code
Python
1
star
35

sideprojects

A repo mainly for unfinished side projects, hoping they'll be finished one day by me or the community.
1
star
36

copy-all-files-from-subdirectories-to-new-directory

Get files by file extension and copy to new directory
Python
1
star
37

Line-Chart

Publication-ready line chart in R
R
1
star
38

New-Files-Monitor

A simple live monitor to get a new files/s rate in a directory. Particularly useful when (chaotically) downloading from multiple threads. A perfect symbiosis with https://github.com/do-me/fast-instagram-scraper
Python
1
star
39

CityGIS

Splash page for GIS afterwork meetups.
1
star
40

js-tricks

Some personal handy js code blocks
1
star
41

hexbins

Leaflet d3-hexbins for social media demo
JavaScript
1
star
42

vatican-windrose

The four Vatican wind rose marble markers for your webpage or Leaflet map
JavaScript
1
star
43

SDG

2023 updated list of SDGs, targets & indicators ready for data processing with data sources
Jupyter Notebook
1
star
44

mlx_parallm_ui

A minimal UI for parallel inferencing based on mlx_parallm
HTML
1
star
45

semantic-hexbins-overturemaps-places

A lightweight frontend app using transformers.js showcasing the use of semantic similarity for geospatial applications such as geosocial media. Building on Overturempas Places.
JavaScript
1
star
46

js-text-chunker

A simple vanilla JS text chunker for hierarchical semantic chunking
HTML
1
star