• Stars
    star
    521
  • Rank 84,952 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created over 5 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

command line codespelunker or code search

codespelunker (cs)

A command line search tool. Allows you to search over code or text files in the current directory either on the console, via a TUI or HTTP server, using some boolean queries or regular expressions.

Consider it a similar approach to using ripgrep, silver searcher or grep coupled with fzf but in a single tool.

Dual-licensed under MIT or the UNLICENSE.

Go Report Card Coverage Status Cs Count Badge

asciicast

Pitch

Why use cs?

  • Reasonably fast
  • Rank results on the fly helping you find things
  • Searches across multiple lines
  • Has a nice TUI interface.
  • Cross-platform (probably needs the new Windows terminal though)

The reason cs exists at all is because I was running into limitations using rg TERM | fzf and decided to solve my own problem.

Install

If you want to create a package to install things please do. Let me know and ill ensure I add it here.

Go Get

If you have Go >= 1.20 installed

go install github.com/boyter/[email protected]

Nixos

nix-shell -p codespelunker

NixOS/nixpkgs#236073

Manual

Binaries for Windows, GNU/Linux and macOS are available from the releases page.

FAQ

Is this as fast as...

No.

You didn't let me finish, I was going to ask if it's as fast as...

The answer is probably no. It's not directly comparable. No other tool I know of works like this outside of full indexing tools such as hound, searchcode, sourcegraph etc... None work on the fly like this does.

While cs does have some overlap with tools like ripgrep, grep, ack or the silver searcher the reality is it does not work the same way, so any comparison is pointless. It is slower than most of them, but its also doing something different.

You can replicate some of what it does by piping their output into fzf though if you feel like a flawed comparison.

On my local machine which at time of writing is a Macbook Air M1 it can search a recent checkout of the linux source code in ~2.5 seconds. While absolute performance is not a design goal, I also don't want this to be a slow tool. As such if any obvious performance gains are on the table I will take them.

Does it work on normal documents?

So long as they are text. I wrote it to search code, but it works just as well on full text documents. The snippet extraction for example was tested on Pride and Prejudice. If you had a heap of PDF's you could shell script some use of pdftotext and get something searchable.

Note it was designed for code and as such has full .ignore and .gitignore support.

Where is the index?

There is none. Everything is brute force calculated on the fly. For TUI mode there are some shortcuts taken with caching of results to speed things up.

How does the ranking work then?

Standard BM25 or TF/IDF or the modified TF/IDF in Lucene https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ which dampens the impact of term frequency.

Technically speaking it's not accurate because it calculates the weights based on what it matched on and not everything, but it works well enough in practice and is calculated on the fly. Try it out and report if something is not working as you expect?

How do you get the snippets?

It's not fun... https://github.com/boyter/cs/blob/master/snippet.go Have a look at the code.

It works by passing the document content to extract the snippet from and all the match locations for each term. It then looks through each location for each word, and checks on either side looking for terms close to it. It then ranks on the term frequency for the term we are checking around and rewards rarer terms. It also rewards more matches, closer matches, exact case matches and matches that are whole words.

For more info read the "Snippet Extraction AKA I am PHP developer" section of this blog post https://boyter.org/posts/abusing-aws-to-make-a-search-engine/

What does HTTP mode look like?

It's a little brutalist.

scc

You can change its look and feel using --template-display and --template-search. See https://github.com/boyter/cs/tree/master/asset/templates for example templates you can use to modify things.

cs -d --template-display ./asset/templates/display.tmpl --template-search ./asset/templates/search.tmpl

Usage

Command line usage of cs is designed to be as simple as possible. Full details can be found in cs --help or cs -h. Note that the below reflects the state of master not a release, as such features listed below may be missing from your installation.

$ cs -h
code spelunker (cs) code search.
Version 1.3.0
Ben Boyter <[email protected]>

cs recursively searches the current directory using some boolean logic
optionally combined with regular expressions.

Works via command line where passed in arguments are the search terms
or in a TUI mode with no arguments. Can also run in HTTP mode with
the -d or --http-server flag.

Searches by default use AND boolean syntax for all terms
 - exact match using quotes "find this"
 - fuzzy match within 1 or 2 distance fuzzy~1 fuzzy~2
 - negate using NOT such as pride NOT prejudice
 - regex with toothpick syntax /pr[e-i]de/

Searches can fuzzy match which files are searched by adding
the following syntax

 - test file:test
 - stuff filename:.go

Files that are searched will be limited to those that fuzzy
match test for the first example and .go for the second.
Example search that uses all current functionality
 - darcy NOT collins wickham~1 "ten thousand a year" /pr[e-i]de/ file:test

The default input field in tui mode supports some nano commands
- CTRL+a move to the beginning of the input
- CTRL+e move to the end of the input
- CTRL+k to clear from the cursor location forward

Usage:
  cs [flags]

Flags:
      --address string            address and port to listen to in HTTP mode (default ":8080")
      --binary                    set to disable binary file detection and search binary files
  -c, --case-sensitive            make the search case sensitive
      --dir string                directory to search, if not set defaults to current working directory
      --exclude-dir strings       directories to exclude (default [.git,.hg,.svn])
  -x, --exclude-pattern strings   file and directory locations matching case sensitive patterns will be ignored [comma separated list: e.g. vendor,_test.go]
  -r, --find-root                 attempts to find the root of this repository by traversing in reverse looking for .git or .hg
  -f, --format string             set output format [text, json, vimgrep] (default "text")
  -h, --help                      help for cs
      --hidden                    include hidden files
  -d, --http-server               start http server for search
  -i, --include-ext strings       limit to file extensions (N.B. case sensitive) [comma separated list: e.g. go,java,js,C,cpp]
      --max-read-size-bytes int   number of bytes to read into a file with the remaining content ignored (default 1000000)
      --min                       include minified files
      --min-line-length int       number of bytes per average line for file to be considered minified (default 255)
      --no-gitignore              disables .gitignore file logic
      --no-ignore                 disables .ignore file logic
  -o, --output string             output filename (default stdout)
      --ranker string             set ranking algorithm [simple, tfidf, tfidf2, bm25] (default "bm25")
  -s, --snippet-count int         number of snippets to display (default 1)
  -n, --snippet-length int        size of the snippet to display (default 300)
      --template-display string   path to display template for custom styling
      --template-search string    path to search template for custom styling
  -v, --version                   version for cs

Searches work on single or multiple words with a logical AND applied between them. You can negate with NOT before a term. You can do exact match with quotes, and do regular expressions using toothpicks.

Example search that uses all current functionality

cs t NOT something test~1 "ten thousand a year" "/pr[e-i]de/" file:test

You can use it in a similar manner to fzf in TUI mode if you like, since cs will return the matching document path if you hit the enter key.

cat `cs`

More Repositories

1

scc

Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
Go
6,574
star
2

searchcode-server

The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases
Java
364
star
3

go-string

Useful Go String methods
Go
186
star
4

lc

licensechecker (lc) a command line application which scans directories and identifies what software license things are under producing reports as either SPDX, CSV, JSON, XLSX or CLI Tabular output. Dual-licensed under MIT or the UNLICENSE.
Go
124
star
5

go-http-template

CSS
84
star
6

Phindex

A modular search indexer similar to Lucene written in pure PHP
PHP
74
star
7

hashit

A cross platform tool to compute hashes of files quickly. Similar to hashdeep.
Go
59
star
8

gocodewalker

Library to help with walking of code directories in go
Go
57
star
9

activitypub

Sequence diagrams of how ActivityPub works
51
star
10

dcd

Duplicate Code Detector
Go
50
star
11

aws-s3-bucket-purger

A program that will purge any AWS S3 Bucket of objects and versions quickly
Go
26
star
12

indexer

Go
23
star
13

BATF

Web Based Big Arse Text File
PHP
21
star
14

searchcode

Official support channel for searchcode.com support issues and the like.
18
star
15

SingleBugs

A simple single person bug tracker
HTML
16
star
16

freemoz

A spiritual sucessor to dmoz.org
FreeMarker
16
star
17

really-cheap-chatbot

Really cheap chatbot
Python
14
star
18

python-license-checker

A license checker for source code written in python
Python
12
star
19

scc-data

Go
12
star
20

decodingcaptchas

Decoding CAPTCHA's in Python for Fun and Profit
JavaScript
9
star
21

php-excerpt

Generate search excerpts from text given search terms in PHP.
PHP
8
star
22

golangvectorspace

An implementation of the Vector Space model in GoLang
Go
8
star
23

boganipsum

Get it up ya!
HTML
7
star
24

java-spelling-corrector

A MIT Licensed Java Spelling Corrector
Java
7
star
25

searchcode-server-highlighter

Go
6
star
26

boyter.org

boyter.org
JavaScript
5
star
27

working-with-rust

Rust
4
star
28

cmuf

Completely Messed Up Filesystem
3
star
29

go-spelling-corrector

Go Spelling Corrector
Go
3
star
30

rss-feeds

Tagged lists of RSS feeds
Python
3
star
31

Mutator

Mutation tester which applies directly to source code.
Python
2
star
32

rcc

rcc
Rust
2
star
33

titfortat

Go
2
star
34

phpentitygenerator

Automatically exported from code.google.com/p/phpentitygenerator
PHP
2
star
35

gm-platformer

Learning Game Maker
Game Maker Language
2
star
36

sloc-cloc-code-presso

sloc cloc and code presso
JavaScript
2
star
37

hephaisteion

Can I have some money now?
2
star
38

KnowledgeTree-Exporter

Exporting Documents from KnowledgeTree 3.7.0.2
Python
2
star
39

wizard-duel

Lua
2
star
40

scc-lambda

Lambda for scc
Python
1
star
41

goignore

Go
1
star
42

spells

Trying to generate spell names based on the Harry Potter books
Python
1
star
43

CanvaQueueTest

Canva Queue Test
Java
1
star
44

codespelunker

Shell
1
star
45

hashit-rust

Hash all the things!
Rust
1
star
46

empire-building

Just playing around with generating names and families based on the world of Tsuranai by Feist and Wurts
Python
1
star
47

zig

Playing around with ziglang
1
star
48

tendersearch

Go
1
star
49

wsl-settings

Shell
1
star
50

portfold_old

Portfold.com
Go
1
star