• Stars
    star
    108
  • Rank 319,348 (Top 7 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created almost 13 years ago
  • Updated over 12 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A small search library.

microsearch

A small search library.

Primarily intended to be a learning tool to teach the fundamentals of search.

Useful for embedding into Python apps where you don't want/need something as complex as Lucene.

Part of my (upcoming) 2012 PyCon talk - https://us.pycon.org/2012/schedule/presentation/66/

Requirements

  • Python 2.5+ or Python 3.2+
  • (Optional) simplejson
  • (Optional) unittest2 (Python 2.5 - for runnning the tests)

Usage

Example:

import microsearch

# Create an instance, pointing it to where the data should be stored.
ms = microsearch.Microsearch('/tmp/microsearch')

# Index some data.
ms.index('email_1', {'text': "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh"})
ms.index('email_2', {'text': 'Everyone,\n\nM-m-m-m-my red stapler has gone missing. H-h-has a-an-anyone seen it?\n\nMilton'})
ms.index('email_3', {'text': "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh"})
ms.index('email_4', {'text': 'How do you feel about becoming Management?\n\nThe Bobs'})

# Search on it.
ms.search('Peter')
ms.search('tps report')

Shortcomings

This library is meant to help others learn. While it has full test coverage, it may not be suitable for production use. Reasons you may not want to use it in Real Code(tm):

  • No concurrency support
    • Tries to work atomically with files
    • But there are no locks
    • So it's possible for writes to overlap between processes
  • Maybe thread-safe?
    • Pretty much everything is on an instance
    • But I haven't tested it extensively with threading
  • No support for deleting documents
    • If an existing document changes or gets deleted, stale data will be left in the index
    • A workaround would be blowing away the index directory, moving the docs out and reindexing them :/
  • Only n-grams are supported
    • Because writing a full Porter or Snowball stemmer is beyond the needs of this library
  • No clue on performance at scale
    • This is a proof-of-concept & learning tool, not Lucene!
    • With a 2011 MBP on the first 1.2K docs of the Enron corpus:
      • Indexing is pretty slow at ~1 document per second
      • Search is pretty fast at ~0.007 sec per query
      • RAM never exceeded 15Mb when indexing, 10Mb when searching
      • Script in the source repo as enron_bench.py.

Running Tests

With a source checkout, run:

In Python 2:

python -m unittest2 tests

In Python 3:

python -m unittest tests

Tests should be passing at all times under both Python 2.7 & Python 3.2.

Contributions

If you wish to contribute to improving microsearch, the code you submit must:

  • Be your own work & BSD-licensed
  • Include a working fix/feature
  • Follow the existing style of the codebase
  • Include passing test coverage of the new code
  • If it's user-facing, must include documentation

Other submissions are welcome, but won't get merged until all of these requirements are met.

author:Daniel Lindsley <[email protected]>
date:2011/02/22

More Repositories

1

restless

A lightweight REST miniframework for Python.
Python
831
star
2

itty

The itty-bitty Python web framework.
Python
409
star
3

shell

A better way to run shell commands in Python.
Python
161
star
4

littleworkers

Little process-based workers to do your bidding.
Python
149
star
5

guide-to-testing-in-django

The example project for adding tests.
Python
147
star
6

pylev

A pure Python Levenshtein implementation that's not freaking GPL'd.
Python
96
star
7

django-budget

A personal budgeting application for use with Django.
JavaScript
48
star
8

multiresponse

A Python class for Django that allows a request to provide content-type aware responses.
Python
45
star
9

friendlydb

A small & fast following/followers database written in Python.
Python
41
star
10

lua-base64

Another base64 implementation.
Lua
35
star
11

django-rsvp

A simple RSVP app.
Python
31
star
12

wsgi_longpolling

Supporting materials for my blog post on WSGI long-polling apps with gevent.
Python
30
star
13

toastbot

A clean, extensible IRC bot using Python & gevent.
Python
30
star
14

quads

A pure Python Quadtree implementation
Python
29
star
15

definite

Simple finite state machines.
Python
28
star
16

bitty

A tiny storage layer. (v0.4) Serious Python Programmersβ„’ with Enterprise Requirements need not apply.
Python
27
star
17

alligator

Simple offline task queues. For Python.
Python
27
star
18

pyskip

A pure Python skiplist implementation. For fun.
Python
23
star
19

feedme

A better (for me) RSS aggregator. Collects numerous RSS feeds and displays entries in chronological order. Similar to the "planet" concept.
Ruby
20
star
20

django-superflatpages

A capable, database-backed flatpages implementation.
Python
19
star
21

rose

A small library for keeping your version up-to-date easily & everywhere.
Python
19
star
22

deployable

A simple system for repeatable deploys. Language-agnostic, easy to use yet extensible, and above all, repeatable.
Python
18
star
23

piecrust

DEAD PROJECT - A REST layer for all Python applications
Python
16
star
24

sockless

A friendlier interface to `socket`.
Python
14
star
25

migrate_doctest_to_unittest

"In Django-land, unittests are much faster than doctests. Convert them."
Python
14
star
26

nanosearch

A tiny search engine.
JavaScript
11
star
27

boto3

**EXPERIMENTAL** Evolution of boto. Supports Py2/3.
Python
10
star
28

itty3

The itty-bitty Python web framework... **Now Rewritten For Python 3!**
Python
10
star
29

django-microapi

A tiny library to make writing CBV-based APIs easier in Django.
Python
8
star
30

todone

Todo lists done my way.
8
star
31

chrono

A (BSD licensed) context manager for timing execution.
Python
8
star
32

steamstalker

A Django pluggable app for stalking your Steam friends' activity.
JavaScript
8
star
33

pubsubittyhub

PubSubHubbub via webhooks. Mostly a port of watercoolr.
Python
8
star
34

dashbot

A node.js-powered IRC bot for the Django Dash.
JavaScript
7
star
35

colloquy_to_textual

Converts logs from Colloquy (XML) to Textual (plain text)
Python
6
star
36

dotfiles

Mah dotfiles + installer.
Shell
6
star
37

mathpractice

Math practice (for the kiddos)
Python
5
star
38

solidrocket

A fun little experimental datastore. Not interesting yet.
Lua
5
star
39

whisper

A toy micro-blog built on Node.js & Postgres.
JavaScript
5
star
40

domicile

Programmatic creation of DOM elements. Vaguely similar to React's DOM bits.
JavaScript
4
star
41

localtable

A thin database-like wrapper over `window.localStorage`.
JavaScript
4
star
42

euler

Project Euler (projecteuler.net) problems
Python
3
star
43

markov

A simple Markov chain generator done purely for fun.
Python
3
star
44

edgy

More NIH
3
star
45

filtering

Probably not useful to you.
Python
2
star
46

lilrocket

A Solr-alike for Whoosh.
Python
2
star
47

django-dashboard

Python
2
star
48

LAAS

It's Levenshtein-As-A-Service. You know, for the lulz.
Python
2
star
49

sciencemuseum

Nat + Simon's entry for the Science Museum's API competition
Python
2
star
50

goplayground

Experiments with Go.
Go
2
star
51

createproject

For creating new Python packages.
Python
2
star
52

django-dash

The code that powers the Django Dash.
Python
2
star
53

carrierpigeon

Contract-based messages.
Python
1
star
54

eliteracing

The source for the Elite Racing Federation website (edracers.com)
Python
1
star
55

GIS-Day

Hosted at KU
1
star
56

electronicinmyears

The Electronic In My Ears (Recommendations)
1
star
57

mistertest

You don't care about this.
Python
1
star
58

the-march-to-3

Python
1
star
59

bobbyblog

Python
1
star