⍼ Resin.Search
Overview | How to install | User guide
HTTP search engine/embedded library
Launch a Resin HTTP server or use the Resin search library to search through any vector space. With hardware accelerated vector operations from MathNet Resin is especially well suited for problem spaces that can be defined as such.
Vector spaces are configured by implementing IModel.
Document database
Resin stores data as document collections. It applies your prefered IModel onto your data while you write and query it. The write pipeline produces a set of indices (graphs), one for each document field, that you may interact with by using the Resin web GUI, the Resin read/write JSON HTTP API, or programmatically.
Vector-based indices
Resin indices are binary search trees and creates clusters of those vectors that are similar to each other, as you populate them with your data. Graph nodes are created in the Tokenize method of your model. When a node is added to the graph its cosine angle, i.e. its similarity to other nodes, determine its position (path) within the graph.
Customizable vector spaces
Resin comes pre-loaded with two IModel vector space configurations: one for text and another for MNIST images. The text model has been tested by validating indices generated from Wikipedia search engine backup files as well as by parsing Common Crawl WAT, WET and WARC files, to determine at which scale Resin may operate in and at what accuracy.
The image model is included mostly as an example of how to implement your own prefered machine-learning algorithm for building custom-made search indices. The error rate of the image classifier is ~5%.
Performance
Currently, Wikipedia size data sets produce indices capable of sub-second phrase searching.
You may also
- build, validate and optimize indices using the command-line tool Sir.Cmd
- read efficiently by specifying which fields to return in the JSON result
- implement messaging formats such as XML (or any other, really) if JSON is not suitable for your use case
- construct queries that join between fields and even between collections, that you may post as JSON to the read endpoint or create programatically.
- construct any type of indexing scheme that produces any type of embeddings with virtually any dimensionality using either sparse or dense vectors.
Applications
Executables
- Sir.HttpServer: HTTP search service with HTML GUI and HTTP JSON API for reading and writing.
- Sir.Cmd: Command line tool that executes commands that implement
Sir.ICommand
. Write, validate, optimize and more via command-line.
Libraries
- Sir.Search: In-process search engine.
- Sir.Core: Core types and shared interfaces, such as
IModel
,ICommand
andIVector
. - Sir.CommonCrawl: Command for downloading and indexing Common Crawl WAT and WET files.
- Sir.Mnist: Command for training and testing the accuracy of a index of MNIST images.
- Sir.Wikipedia: Command for indexing Wikipedia.
Roadmap
- v0.1a - bag-of-characters vector space language model
- v0.2a - HTTP API
- v0.3a - query language
- v0.4 - linear classifier image model
- v0.5 - semantic language model
- v1.0 - voice model
- v2.0 - image-to-voice
- v2.1 - voice-to-text
- v2.2 - text-to-image
- v2.3 - AI
Backlog
Huge
- Distribute data set across many servers (sharding, replication; RPC) or in other ways allow for horisontal scaling
Big
- Memory mapping (to increase speed of querying and perhaps also writing; to increase scalability)
- Update index (allow removal of documents; allow appending to an already persisted index token's postings list)
- Async IO (for scalability)
- Indexing of types other than string
- Enable combining fields with different types in a document/model
- Split application into "crawler" and "search"