• Stars
    star
    206
  • Rank 190,504 (Top 4 %)
  • Language
    Python
  • License
    BSD 2-Clause "Sim...
  • Created about 11 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A new version of phraug, which is a set of simple Python scripts for pre-processing large files

phraug2

A new version of phraug (pron. frog) with improved command line arguments parsing, thanks to jofusa.

This is a set of simple Python scripts for pre-processing large files, things like splitting and format conversion. The names phraug comes from a great book, Made to Stick, by Chip and Dan Heath.

See http://fastml.com/processing-large-files-line-by-line/ for the basic idea.

There's always at least one input file and usually one or more output files. An input file always stays unchanged.

For documentation:

Example:

>python split.py
usage: split.py [-h] [-p PROBABILITY] [-r RANDOM_SEED] [-s] [-c]
				input_file output_file1 output_file2
split.py: error: too few arguments

>python split.py -h
usage: split.py [-h] [-p PROBABILITY] [-r RANDOM_SEED] [-s] [-c]
				input_file output_file1 output_file2

split a file into two randomly, line by line.

positional arguments:
  input_file            path to an input file
  output_file1          path to the first output file
  output_file2          path to the second output file

optional arguments:
  -h, --help            show this help message and exit
  -p PROBABILITY, --probability PROBABILITY
						probability of writing to the first file (default 0.9)
  -r RANDOM_SEED, --random_seed RANDOM_SEED
						random seed
  -s, --skip_headers    skip the header line
  -c, --copy_headers    copy the header line to both output files

More Repositories

1

goodbooks-10k

Ten thousand books, six million ratings
Jupyter Notebook
788
star
2

hyperband

Tuning hyperparams fast with Hyperband
Python
587
star
3

phraug

A set of simple Python scripts for pre-processing large files
Python
271
star
4

numer.ai

Validation and prediction code for numer.ai
Python
150
star
5

kaggle-blackbox

Deep learning made easy
MATLAB
115
star
6

classifying-text

Classifying text with bag-of-words
Python
114
star
7

adversarial-validation

Creating a better validation set when test examples differ from training examples
Python
100
star
8

evaluating-recommenders

Compute and plot NDCG for a recommender system
Python
95
star
9

time-series-classification

Classifying time series using feature extraction
Python
86
star
10

classifier-calibration

Reliability diagrams, Platt's scaling, isotonic regression
Python
71
star
11

kaggle-advertised-salaries

Predicting job salaries from ads - a Kaggle competition
Python
55
star
12

the-secret-of-the-big-guys

k-means + a linear model = good results
Python
55
star
13

pointer-networks-experiments

Sorting numbers with pointer networks
Python
55
star
14

kaggle-cats-and-dogs

Classifying images with OverFeat
Python
46
star
15

kaggle-stackoverflow

Predicting closed questions on Stack Overflow
Python
46
star
16

gaussrank

Preparing continuous features for neural networks with GaussRank
Python
45
star
17

kaggle-happiness

Predicting happiness from demographics and poll answers
Python
45
star
18

kaggle-cifar

Code for the CIFAR-10 competition at Kaggle, uses cuda-convnet
Python
44
star
19

sofia-ml-mod

sofia-kmeans with sparse RBF cluster mapping
C++
42
star
20

pylearn2-practice

Pylearn2 in practice
Python
41
star
21

kaggle-burn-cpu

Code for the "Burn CPU, burn" competition at Kaggle. Uses Extreme Learning Machines and hyperopt.
Python
33
star
22

kaggle-amazon

Amazon access control challenge
Python
25
star
23

pybrain-practice

A regression example for PyBrain
Python
25
star
24

wine-quality

Predicting wine quality
R
25
star
25

dimensionality-reduction-for-sparse-binary-data

convert a lot of zeros and ones to fewer real numbers
Python
23
star
26

cubert

How to make those 3D data visualizations
JavaScript
22
star
27

kaggle-gender

A Kaggle competition: discriminate gender based on handwriting
Python
21
star
28

msda-denoising

Using a very fast denoising autoencoder
MATLAB
17
star
29

kaggle-solar

Code for Solar Energy Prediction Contest at Kaggle
Python
17
star
30

nonlinear-vowpal-wabbit

How to use automatic polynomial features and neural network mode in VW
Python
17
star
31

metric-learning-for-regression

Applying metric learning to kin8nm
MATLAB
16
star
32

kaggle-avito

Code for the Avito competition
Python
16
star
33

kaggle-rossmann

Predicting sales with Pandas
Python
15
star
34

spearmint

tuning hyperparams automatically with spearmint
R
15
star
35

kaggle-accelerometer

Code for Accelerometer Biometric Competition at Kaggle
Python
15
star
36

large-scale-linear-learners

VW, Liblinear and StreamSVM compared on webspam
Python
14
star
37

r-libsvm-format-read-write

R code for reading and writing files in libsvm format
R
14
star
38

stardose

A recommender system for GitHub repositories
Python
13
star
39

running-external-programs-from-python

Python
11
star
40

feature-selection

Selecting features for classification with MRMR
R
11
star
41

kaggle-merck

Merck challenge at Kaggle
Python
10
star
42

kaggle-stumbleupon

bag of words + sparsenn
Python
10
star
43

project-rhubarb

predicting mortality in England using air quality data
Python
9
star
44

kaggle-bestbuy_big

Code for the Best Buy competition at Kaggle
Python
8
star
45

kaggle-digits

Some code for the Digits competition at Kaggle, incl. pylearn2's maxout
MATLAB
8
star
46

misc

misc
Jupyter Notebook
7
star
47

kaggle-poker-hands

Code for the Poker Rule Induction competition
Python
7
star
48

kaggle-bestbuy_small

Python
6
star
49

AlpacaGPT

How to train your own ChatGPT, Alpaca style
Python
3
star
50

kaggle-jobs

Some auxiliary code for Kaggle job recommendation challenge
Python
2
star