• Stars
    star
    1,937
  • Rank 23,925 (Top 0.5 %)
  • Language
    Python
  • Created about 14 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Command line utilities for data analysis

data_hacks

Command line utilities for data analysis

Installing: pip install data_hacks

Installing from github pip install -e git://github.com/bitly/data_hacks.git#egg=data_hacks

Installing from source python setup.py install

data_hacks are friendly. Ask them for usage information with --help

histogram.py

A utility that parses input data points and outputs a text histogram

Example:

$ cat /tmp/data | histogram.py --percentage --max=1000 --min=0
# NumSamples = 60; Min = 0.00; Max = 1000.00
# 1 value outside of min/max
# Mean = 332.666667; Variance = 471056.055556; SD = 686.335236; Median 191.000000
# each ∎ represents a count of 1
    0.0000 -   100.0000 [    28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (46.67%)
  100.0000 -   200.0000 [     2]: ∎∎ (3.33%)
  200.0000 -   300.0000 [     2]: ∎∎ (3.33%)
  300.0000 -   400.0000 [     8]: ∎∎∎∎∎∎∎∎ (13.33%)
  400.0000 -   500.0000 [     8]: ∎∎∎∎∎∎∎∎ (13.33%)
  500.0000 -   600.0000 [     7]: ∎∎∎∎∎∎∎ (11.67%)
  600.0000 -   700.0000 [     3]: ∎∎∎ (5.00%)
  700.0000 -   800.0000 [     0]:  (0.00%)
  800.0000 -   900.0000 [     1]: ∎ (1.67%)
  900.0000 -  1000.0000 [     0]:  (0.00%)

With logarithmic scale

$ printf 'import random\nfor i in range(1000):\n print random.randint(0,10000)'|\
    python -|./data_hacks/histogram.py -l
# NumSamples = 1000; Min = 2.00; Max = 9993.00
# Mean = 4951.757000; Variance = 8279390.995951; SD = 2877.393090; Median 4828.000000
# each ∎ represents a count of 6
    2.0000 -    11.7664 [     3]:
   11.7664 -    31.2991 [     0]:
   31.2991 -    70.3646 [     5]:
   70.3646 -   148.4956 [    11]: ∎
  148.4956 -   304.7576 [    15]: ∎∎
  304.7576 -   617.2815 [    35]: ∎∎∎∎∎
  617.2815 -  1242.3294 [    51]: ∎∎∎∎∎∎∎∎
 1242.3294 -  2492.4252 [   128]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 2492.4252 -  4992.6168 [   269]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 4992.6168 -  9993.0000 [   483]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎

ninety_five_percent.py

A utility script that takes a stream of decimal values and outputs the 95% time.

This is useful for finding the 95% response time from access logs.

Example (assuming response time is the last column in your access log):

$ awk '{print $NF}' /path/to/access.log | ninety_five_percent.py

sample.py

Filter a stream to a random sub-sample of the stream

Example:

$ cat access.log | sample.py 10% | post_process.py

run_for.py

Pass through data for a specified amount of time

Example:

$ tail -f access.log | run_for.py 10s | post_process.py

bar_chart.py

Generate an ascii bar chart for input data (this is like a visualization of uniq -c)

$ cat data | bar_chart.py
# each ∎ represents a count of 1. total 63
14:40 [    49] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
14:41 [    14] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎

bar_chart.py and histogram.py also support ingesting pre-aggregated values. Simply provide a two column input of count<whitespace>value for -a or value<whitespace>count for -A:

$ sort /path/to/data | uniq -c | bar_chart.py -a

This is very convenient if you pull data out, say Hadoop or MySQL already aggregated.

More Repositories

1

oauth2_proxy

A reverse proxy that provides authentication with Google, Github or other provider
Go
5,103
star
2

go-simplejson

a Go package to interact with arbitrary JSON
Go
3,719
star
3

dablooms

scaling, counting, bloom filter library
C
965
star
4

asyncmongo

An asynchronous library for accessing mongo with tornado.ioloop
Python
611
star
5

statsdaemon

an implementation of Etsy's statsd in Go
Go
570
star
6

simplehttp

a family of libraries and daemons for building scalable web infrastructure
C
539
star
7

go-hostpool

Intelligently and flexibly pool among multiple hosts from your Go application
Go
377
star
8

bitly-api-python

python library to the bitly api
Python
244
star
9

go-notify

a Go package to observe notable events in a decoupled fashion
Go
236
star
10

forgettable

Various implementations of the forget table: a distributional database that forgets data
Go
201
star
11

file2http

spray a line-oriented file at an HTTP endpoint
Go
84
star
12

pyqrencode

python bindings for libqrencode
C
67
star
13

asyncdynamo

async Amazon DynamoDB library for Tornado
Python
60
star
14

ngx_http_full_request_log

nginx module to log a full request
C
32
star
15

bitly_chrome_extension

bit.ly chrome extension
JavaScript
27
star
16

little_bigtable

Little Bigtable is an emulator for Google Bigtable w/ persistence in sqlite3
Go
21
star
17

timer_metrics

TimerMetrics captures timings and enables periodic metrics every n events
Go
15
star
18

assetman

Assetman burns countless cycles and makes your deploys take forever
Python
14
star
19

doozer-c

async C client library for doozerd
C
10
star
20

bitly_ios_sdk_release

Bitly SDK for iOS
7
star
21

tsplot

Go
1
star
22

sprinterns2022

Python
1
star
23

winterns2021

Python
1
star