• This repository has been archived on 27/Aug/2020
  • Stars
    star
    9
  • Rank 1,879,107 (Top 39 %)
  • Language
    Julia
  • License
    Other
  • Created over 6 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Some helper functions to make some group by operations on DataFrames and IndexedTables faster

This package is deprecated as the base DataFrames.jl group-by is plenty fast

FastGroupBy

Faster algorithms for doing vector group-by. This package currently support faster group-bys where the group-by vector is of type CategoricalVector or Vector{T} for T<:Union{Integer, Bool, String}.

Installation


# install
Pkg.add("FastGroupBy")
# install latest version
Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git")

fastby and fastby!

The fastby and fastby! functions allow the user to perform arbitrary computation on a vector (valvec) grouped by another vector (byvec). Their output format is a Tuple where the first element are the distinct groups and the second are the results of applying the function, fn on the valvec grouped-by by, see below for explanation of fn, byvec, and valvec.

The difference between fastby and fastby! is that fastby! may change the input vectors byvec and valvec whereas fastby won't.

Both functions have the same three main arguments, but we shall illustrate using fastby only


fastby(fn, byvec, valvec)
  • fn is a function fn to be applied to each by-group of valvec
  • byvec is the vector to group-by
  • valvec is the vector that fn is applied to

For example fastby(sum, byvec, valvec) is equivalent to StatsBase's countmap(byvec, weights(valvec)). Consider the below

using FastGroupBy

byvec  = [88, 888, 8, 88, 888, 88]
valvec = [1 , 2  , 3, 4 , 5  , 6]
6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

to compute the sum value of valvec in each group of byvec we do

grpsum = fastby(sum, byvec, valvec)
expected_result = Dict(88 => 11, 8 => 3, 888 => 7)
Dict(zip(grpsum...)) == expected_result # true
true

fastby! with an arbitrary fn

You can also compute arbitrary functions for each by-group e.g. mean

using Statistics: mean
@time a = fastby(mean, byvec, valvec)
0.000657 seconds (24 allocations: 1.502 MiB)
([8, 88, 888], [3.0, 3.6666666666666665, 3.5])

This generalizes to arbitrary user-defined functions e.g. the below computes the sizeof each element within each by group

byvec  = [88   , 888  , 8  , 88  , 888 , 88]
valvec = ["abc", "def", "g", "hi", "jk", "lmop"]
@time a = fastby(yy -> sizeof.(yy), byvec, valvec);
0.290550 seconds (280.04 k allocations: 14.957 MiB)

Julia's do-notation can be used

@time a = fastby(byvec, valvec) do grouped_y
    # you can perform complex calculations here knowing that grouped_y is y grouped by x
    grouped_y[end] * grouped_y[1]
end;
0.172302 seconds (194.41 k allocations: 10.657 MiB)

The fastby is fast if group by a vector of Bool's as well

using Random
Random.seed!(1)
x = rand(Bool, 100_000_000);
y = rand(100_000_000);

@time fastby(sum, x, y)
3.132733 seconds (37 allocations: 774.866 MiB, 6.21% gc time)
(Bool[1, 0], [2.499741155973099e7, 2.5003502408479996e7])

The fastby works on String type as well but is still slower than countmap and uses MUCH more RAM and therefore is NOT recommended (at this stage).

using Random
const M=10_000_000; const K=100;
Random.seed!(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:MรทK], M);
y = repeat([1], inner=length(svec1));
@time a = fastby!(sum, svec1, y);
4.704647 seconds (491.16 k allocations: 912.926 MiB, 24.89% gc time)

a_dict = Dict(zip(a...))

using StatsBase
@time b = countmap(svec1, alg = :dict);
1.523348 seconds (48 allocations: 5.670 MiB)
a_dict == b #true
true

fastby on DataFrames

One can also apply fastby on DataFrame by supplying the DataFrame as the second argument and its columns using Symbol in the third and fourth argument, being bycol and valcol respectively. For example

using DataFrames

df1 = DataFrame(grps = rand(1:100, 1_000_000), val = rand(1_000_000))
# compute the difference between the number rows in that group and the mean of `val` in that group
res = fastby(val_grouped -> length(val_grouped) - mean(val_grouped), df1, :grps, :val)
100ร—2 DataFrame
โ”‚ Row โ”‚ grps  โ”‚ V1      โ”‚
โ”‚     โ”‚ Int64 โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 10062.5 โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 9956.5  โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 10026.5 โ”‚
โ”‚ 4   โ”‚ 4     โ”‚ 9953.5  โ”‚
โ”‚ 5   โ”‚ 5     โ”‚ 9855.5  โ”‚
โ”‚ 6   โ”‚ 6     โ”‚ 10019.5 โ”‚
โ”‚ 7   โ”‚ 7     โ”‚ 10065.5 โ”‚
โ‹ฎ
โ”‚ 93  โ”‚ 93    โ”‚ 9968.5  โ”‚
โ”‚ 94  โ”‚ 94    โ”‚ 10096.5 โ”‚
โ”‚ 95  โ”‚ 95    โ”‚ 10008.5 โ”‚
โ”‚ 96  โ”‚ 96    โ”‚ 10037.5 โ”‚
โ”‚ 97  โ”‚ 97    โ”‚ 9885.5  โ”‚
โ”‚ 98  โ”‚ 98    โ”‚ 10019.5 โ”‚
โ”‚ 99  โ”‚ 99    โ”‚ 9937.5  โ”‚
โ”‚ 100 โ”‚ 100   โ”‚ 10058.5 โ”‚

More Repositories

1

JDF.jl

Julia DataFrames serialization format
Julia
87
star
2

JLBoost.jl

A 100%-Julia implementation of Gradient-Boosting Regression Tree algorithms
Julia
68
star
3

TableScraper.jl

Scrape WELL-FORMED tables from webpages
Julia
28
star
4

DataConvenience.jl

Convenience functions missing in Julia
Julia
24
star
5

awesome-julia-performance

Packages and other resourced designed to make things run fast in Julia
24
star
6

SortingLab.jl

Faster sorting algorithms (sort and sortperm) for Julia
Julia
23
star
7

awesome-eda

Awesome Exploratory Data Analysis (EDA)
22
star
8

TidyStanza.jl

Attempting to implement some {tidyverse} APIs in Julia
Julia
20
star
9

FstFileFormat.jl

Julia bindings for the fst format
Julia
19
star
10

julia-data-science-base-docker-img

Julia Data Science Docker with data science packages compiled for instant loading!
Dockerfile
13
star
11

awesome-data-science-notebook-engines

A collection of Data Science focused notebook engines
10
star
12

PkgVersionHelper.jl

Julia
9
star
13

JLBoostMLJ.jl

MLJ.jl interface for JLBoost.jl
Julia
9
star
14

sas7bdat-resources

A list of publicly available resources regarding the SAS7BDAT file format
8
star
15

intro_r_data_science

evalparse.io (http://evalparse.io) - practical introduction to the R language with RStudio
R
8
star
16

Game2048.jl

GUIs for playing the game 2048 in Julia
Julia
7
star
17

teradata.dplyr

A Teradata backend for dplyr
R
6
star
18

awesome-shogi

6
star
19

TraitWrappers.jl

A simple traits system where the trait-type contains the object
Julia
6
star
20

awesome-ml-frameworks

Awesome (or not so) Machine Learning Frameworks
5
star
21

data_manipulation_benchmarks

A set of data manipulation benchmarking code for Julia and R
Julia
5
star
22

gosocket

A simple toy example of goroutine-like function go.socket which allows code to be executed in a non-blocking manner on a socket server
R
4
star
23

awesome-go-baduk-weiqi

Good Go/Baduk/Weiqi resources
4
star
24

baduk-go-weiqi-ratings

Julia
4
star
25

awesome-big-medium-data-frameworks

Not sure if they are awesome, but listing them anyway
4
star
26

Game2048Core.jl

A core engine of 2048 without visualizations
Julia
4
star
27

CuCountMap.jl

Fast `StatsBase.countmap` for small types on the GPU via CUDA.jl
Julia
3
star
28

Diban.jl

DรฌbวŽn (ๅœฐๆฟ) is a Parquet reader and writer
Julia
3
star
29

RobustDataPipelines.jl

Easy-to-use and production-worthy data pipeline in Julia
Julia
2
star
30

awesome-financial-crisis

2
star
31

BadukGoWeiqiTools.jl

Julia
2
star
32

data-science-channel

https://www.youtube.com/playlist?list=UUKxHtRdtFEPihEjTtjG8Y8w&playnext=1&index=1
Julia
2
star
33

retcred

R package: Retail Credit Risk
R
2
star
34

ProjectEulerUtil.jl

Utility functions for Project Euler
Julia
2
star
35

julia-book-project

Trying to write a book about Julia
1
star
36

shinydistro

A set of code to help ease the distribution of Shiny app as desktop apps
R
1
star
37

shinyalert-example

A simple shiny alert example
R
1
star
38

awesome-stuff

things I find awesome and why
1
star
39

awesome-flow

anything flow base like flow chart libraries and flow based programming stuff
1
star
40

biglm

The biglm package by Professor Lumley
R
1
star
41

shinystress

A simple portfolio-managed (e.g. home loans, credit cards) credit risk stress testing framework
R
1
star
42

Game2048Experiments.jl

Julia
1
star
43

awesome-visual-flow-data-science

Data Science tools that have a visualized flow-chart-like interface
1
star
44

football_league_predictions

EPL predictions
R
1
star
45

DataBench.jl

A package to benchmark data manipulation in Julia vs R data.table
Julia
1
star
46

FireFlower.jl

A data and data frames manipulation framework that is distributable
Julia
1
star
47

SimpleSudoku.jl

Simple Sudoku solver. Nothing more, nothing less.
Julia
1
star
48

awesome-2048

2048 resources
1
star
49

awesome-markdown-table-editors

1
star
50

Fread.jl

Using R's {data.table}'s excellent `fread` in Julia
Julia
1
star
51

csv-to-parquet

A CSV to parquet executable that can process data in chunks allowing are very large to CSVs to be converted to parquet
Julia
1
star
52

CopyPathWithBackslashQuotes

Copy paths with backslashes and quotes (F3)
Python
1
star
53

intro_to_julia

Julia
1
star
54

awesome-julia-desktop-app-makers

A list of possible ways to make desktop apps in Julia
1
star
55

wsumm

A set of Write-functions to make gorilla/websocket's write to connection be blocking if another Write-function was using the connection
Go
1
star