• Stars
    star
    2,092
  • Rank 22,081 (Top 0.5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multi-tool for semantic search

Semantra

semantra_demo_vid.mov

Semantra is a multipurpose tool for semantically searching documents. Query by meaning rather than just by matching text.

The tool, made to run on the command line, analyzes specified text and PDF files on your computer and launches a local web search application for interactively querying them. The purpose of Semantra is to make running a specialized semantic search engine easy, friendly, configurable, and private/secure.

Semantra is built for individuals seeking needles in haystacks โ€” journalists sifting through leaked documents on deadline, researchers seeking insights within papers, students engaging with literature by querying themes, historians connecting events across books, and so forth.

Resources

  • Tutorial: a gentle introduction to getting started with Semantra โ€” everything from installing the tool to hands-on examples of analyzing documents with it
  • Guides: practical guides on how to do more with Semantra
  • Concepts: Explainers on some concepts to better understand how Semantra works
  • Using the web interface: A reference on how to use the Semantra web app

This page gives a high-level overview of Semantra and a reference of its features. It's also available in other languages: Semantra en espaรฑol, Semantra ไธญๆ–‡่ฏดๆ˜Ž

Installation

Ensure you have Python >= 3.9.

The easiest way to install Semantra is via pipx. If you do not have pipx installed, run:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

Open a new terminal window for the new path settings pipx sets to go into effect. Then run:

pipx install semantra

This will install Semantra on your path. You should be able to run semantra in the terminal and see output.

Usage

Semantra operates on collections of documents โ€” text or PDF files โ€” stored on your local computer.

At its simplest, you can run Semantra over a single document by running:

semantra doc.pdf

You can run Semantra over multiple documents, too:

semantra report.pdf book.txt

Semantra will take some time to process the input documents. This is a one-time operation per document (subsequent runs over the same document collection will be near instantaneous).

Once processing is complete, Semantra will launch a local webserver, by default at localhost:8080. On this web page, you can interactively query the passed in documents semantically.

Quick notes:

When you first run Semantra, it may take several minutes and several hundred megabytes of hard disk space to download a local machine learning model that can process the document you passed in. The model used can be customized, but the default one is a great mix of being fast, lean, and effective.

If you want to process documents quickly without using your own computational resources and don't mind paying or sharing data with external services, you can use OpenAI's embedding model.

Quick tour of the web app

When you first navigate to the Semantra web interface, you will see a screen like this:

Semantra web interface

Type in something in the search box to start querying semantically. Hit Enter or click the search icon to execute the query.

Search results will appear in the left pane ordered by most relevant documents:

Semantra search results

The yellow scores show relevance from 0-1.00. Anything in the 0.50 range indicates a strong match. Lighter brown highlights will stream in over the search results explaining the most relevant portions to your query.

Clicking on a search result's text will navigate to the relevant section of the associated document.

Highlighted search result in document

Clicking on the plus/minus buttons associated with a search result will positively/negatively tag those results. Re-running the query will cause these additional query parameters to go into effect.

Positively/negatively tagging search results

Finally, text queries can be added and subtracted with plus/minus signs in the query text to sculpt a precise semantic meaning.

Adding and subtracting text queries

For a more in-depth walkthrough of the web app, check out the tutorial or the web app reference.

Quick concepts

Using a semantic search engine is fundamentally different than an exact text matching algorithm.

For starters, there will always be search results for a given query, no matter how irrelevant it is. The scores may be really low, but the results will never disappear entirely. This is because semantic searching with query arithmetic often reveals useful results amid very minor score differences. The results will always be sorted by relevance and only the top 10 results per document are shown so the lower scoring results are cut off automatically.

Another difference is that Semantra will not necessarily find exact text matches if you query something that directly appears in the document. At a high level, this is because words can mean different things in different contexts, e.g. the word "leaves" can refer to the leaves on trees or to someone leaving. The embedding models that Semantra uses convert all the text and queries you enter into long sequences of numbers that can be mathematically compared, and an exact substring match is not always significant in this sense. See the embeddings concept doc for more information on embeddings.

Command-line reference

semantra [OPTIONS] [FILENAME(S)]...

Options

  • --model [openai|minilm|mpnet|sgpt|sgpt-1.3B]: Preset model to use for embedding. See the models guide for more info (default: mpnet)
  • --transformer-model TEXT: Custom Huggingface transformers model name to use for embedding (only one of --model and --transformer-model should be specified). See the models guide for more info
  • --windows TEXT: Embedding windows to extract. A comma-separated list of the format "size[_offset=0][_rewind=0]. A window with size 128, offset 0, and rewind of 16 (128_0_16) will embed the document in chunks of 128 tokens which partially overlap by 16. Only the first window is used for search. See the windows concept doc for more information (default: 128_0_16)
  • --encoding: Encoding to use for reading text files [default: utf-8]
  • --no-server: Do not start the UI server (only process)
  • --port INTEGER: Port to use for embedding server (default: 8080)
  • --host TEXT: Host to use for embedding server (default: 0.0.0.0)
  • --pool-size INTEGER: Max number of embedding tokens to pool together in requests
  • --pool-count INTEGER: Max number of embeddings to pool together in requests
  • --doc-token-pre TEXT: Token to prepend to each document in transformer models (default: None)
  • --doc-token-post TEXT: Token to append to each document in transformer models (default: None)
  • --query-token-pre TEXT: Token to prepend to each query in transformer models (default: None)
  • --query-token-post TEXT: Token to append to each query in transformer models (default: None)
  • --num-results INTEGER: Number of results (neighbors) to retrieve per file for queries (default: 10)
  • --annoy: Use approximate kNN via Annoy for queries (faster querying at a slight cost of accuracy); if false, use exact exhaustive kNN (default: True)
  • --num-annoy-trees INTEGER: Number of trees to use for approximate kNN via Annoy (default: 100)
  • --svm: Use SVM instead of any kind of kNN for queries (slower and only works on symmetric models)
  • --svm-c FLOAT: SVM regularization parameter; higher values penalize mispredictions more (default: 1.0)
  • --explain-split-count INTEGER: Number of splits on a given window to use for explaining a query (default: 9)
  • --explain-split-divide INTEGER: Factor to divide the window size by to get each split length for explaining a query (default: 6)
  • --num-explain-highlights INTEGER: Number of split results to highlight for explaining a query (default: 2)
  • --force: Force process even if cached
  • --silent: Do not print progress information
  • --no-confirm: Do not show cost and ask for confirmation before processing with OpenAI
  • --version: Print version and exit
  • --list-models: List preset models and exit
  • --show-semantra-dir: Print the directory semantra will use to store processed files and exit
  • --semantra-dir PATH: Directory to store semantra files in
  • --help: Show this message and exit

Frequently asked questions

Can it use ChatGPT?

No, and this is by design.

Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results. Generative language models occasionally produce outwardly plausible but ultimately incorrect information, placing the burden of verification on the user. Semantra treats primary source material as the only source of truth and endeavors to show that a human-in-the-loop search experience on top of simpler embedding models is more serviceable to users.

Development

The Python app is in src/semantra/semantra.py and is managed as a standard Python command-line project with pyproject.toml.

The local web app is written in Svelte and managed as a standard npm application.

To develop for the web app cd into client and then run npm install.

To build the web app, run npm run build. To build the web app in watch mode and rebuild when there's changes, run npm run build:watch.

Contributions

The app is still in early stages, but contributions are welcome. Please feel free to submit an issue for any bugs or feature requests.

More Repositories

1

textra

A command-line application to convert images, PDFs, and audio files to text using Apple's APIs
Swift
455
star
2

figma-code-editor-widget

A Figma code editor widget
TypeScript
15
star
3

tapcompose

Autocomplete for music composition
JavaScript
10
star
4

stepfunction-visualizer

A toolkit to debug and visualize local AWS step functions
TypeScript
9
star
5

svue

Vue-like data/computed structures in pure Svelte
JavaScript
9
star
6

newyearstimer

Play YouTube videos synced so that something special happens right at midnight on New Yearโ€™s Eve
HTML
4
star
7

covid19map.us

Explorable map of COVID-19 cases in the United States.
Svelte
3
star
8

distinguish

Effortless renaming, minification, and namespacing for CSS and more
TypeScript
3
star
9

FastFEC-wasm-demo

A demo of running FastFEC in the browser with WebAssembly
HTML
3
star
10

refugee-data-scraper

Scrapes day-by-day immigration data from refugees of all countries to US counties
Python
3
star
11

collabowrite

A collaborative writing website that crowdsources book creation
JavaScript
2
star
12

curve2d

HTML5 remake of web classic Achtung die Kurve
JavaScript
2
star
13

level-up-typescript

A high-level walkthrough for learning TypeScript
TypeScript
2
star
14

poly-ast

Experimental AST that can compile/represent a web application with content, styles, and behavior
TypeScript
1
star
15

PlanetPlotter

Planet Plotter โ€“ coming soon
HTML
1
star
16

blm.js

A web script to add a support footer for Black Lives Matter and related causes on any site
HTML
1
star
17

feeder.news

News feed aggregator (WIP)
TypeScript
1
star
18

sonority

Visualizing harmonic similarities in The Beatles' songs. https://sonority.io
HTML
1
star
19

compjourn

An envisioned computational journalism course curriculum
JavaScript
1
star
20

datajourn

Source code for the https://datajourn.com website
Vue
1
star
21

run_power_converter

Bluetooth conversion service between cycling power meter and running speed
JavaScript
1
star
22

infiniteterrain

Explorable infinite terrain in WebGL
JavaScript
1
star
23

combinational

HTML
1
star
24

wip-semantra-0.2

Testing out some early, work-in-progress new ideas for Semantra v0.2
Python
1
star
25

datajet

A client-side tool for exploring SQLite datasets
CSS
1
star
26

poly

The polyglot web programming language
JavaScript
1
star