• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Nim
  • License
    MIT License
  • Created over 3 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A dataframe library with a dplyr like API

Datamancer

https://github.com/SciNim/datamancer/workflows/datamancer%20CI/badge.svg https://img.shields.io/static/v1?message=join%20chat&color=blue&label=nim-science&logo=matrix&logoColor=gold&style=flat-square&.svg https://img.shields.io/discord/371759389889003530?color=blue&label=nim-science&logo=discord&logoColor=gold&style=flat-square&.svg

Comparison to other dataframe libraries

Check out the following gist for a comparison of this library with dplyr (R) and pandas (Python):

https://gist.github.com/Vindaar/6908c038707c7d8293049edb3d204f84

Documentation

The documentation is found at:

https://scinim.github.io/Datamancer/

with a short introduction under:

https://scinim.github.io/Datamancer/datamancer.html

and a tutorial for data wrangling with Datamancer available at:

https://scinim.github.io/getting-started/basics/data_wrangling.html

Installation & dependencies

Installation should be just a

nimble install datamancer

away.

Features and formulas

The data frame provides the “5 verbs” of dplyr and more. Main implemented functions:

  • filter
  • mutate, transmute
  • select, rename
  • arrange
  • summarize
  • group_by
  • arrange
  • inner_join
  • set_diff
  • count
  • bind_rows
  • gather
  • unique,

which are all based on the FormulaNode object. Basically they all receive varargs[FormulaNode], which is evaluated in context of the given dataframe.

Supported types

By default the DataFrame type supports these types as builtins:

  • float
  • int
  • string
  • bool
  • Value (a variant object that can store either of the above)

As of version v0.3.0 very experimental support to extend the supported types has landed. See the changelog.org for an explanation or the playground/non_generic_generics.nim example to get an idea.

Essentially any arbitrary type can be used in a DF by extending the Column type using some macro magic.

A few words on the f{} macro to create formulas

Use:

  • no infix symbol and only code, which does not involve a column in the sense defined below in Column access:
    f{1 + 2}
    f{"foo"}
    f{true}
        

    a FormulaNode of kind fkVariable. Stores the values as a Value variant object.

  • <- for assignment
    f{"newName" <- "oldName"}
        

    a FormulaNode of kind fkAssign. This does not involve a closure and is just a simple object storing a LHS as a string and the RHS as a Value (to also support constant columns via f{"constantCol" <- 5}). Typically used for rename or as an argument for transmute and mutate to just rename a column or to assign a constant column.

  • << for reduce operations
    f{"meanHwy" << mean(`hwy`)}
        

    a FormulaNode of kind fkScalar. Used only for summarize and means we reduce a full column to a single Value. This generates a closure, which computes the RHS and assigns it to a result variable of type Value. Type hints are required (for now) if only a single proc call is involved on the RHS to tell the macro as what to read the column “hwy” and what the result variable is.

  • ~ for vector like proc
    f{"xSquared" ~ `x` * `x`}
        

    a FormulaNode of kind fkVector. Used in mutate, transmute to calculate a full column. This also generates a closure as the reduce operations << does, except here we loop over the length of the DF and access each read tensor via [idx].

  • a formula without any infix symbols will be considered:
    • fkVariable if no column involved
    • fkVector else

Column access

To access columns in the context of formula, the biggest change occured. In the old formula system, a literal string was attempted to be resolved as a DF column dynamically. Since the new formulas are compiled to closures, this would involve overhead and is thus avoided for clearer separation between columns and real strings. This also helps readers of a formula.

This means:

  • `columnName`: accented quotes refer to a DF column. Be careful to only use this for simple letters (no non letter characters or spaces).
  • c"columnName"= : call string literals (by convention use a =c before the string) are interpreted as a column in the same way as accented quotes, but allow for column names with spaces / non letter characters.
  • idx("columnName"), idx(`columnName`), idx(nimExpressionReturningString): refers to a specific element of the referred column
  • col("columnName"), col(`columnName`), col(nimExpressionReturningString): refers to a the full tensor of the referred column
  • or directly via: df[nimExpressionReturningString] / df[nimExpressionReturningString][idx]: to access columns / indices using identifiers / symbols / general expressions that return a string quotes, call string literals or just string literals). This is equivalent to idx / col, so the latter are preferred.

The closures take a data frame as an argument, which is named df. The df["columnName"] refers to that argument, although not literally (it is gen’symmed and df["columnName"] refers to a Column). From that column we get the underlying Tensor.

In the context of calling procedures, e.g.:

f{someProc(`columnName`)}

it may not be clear whether the procedure is supposed to take the whole tensor as an argument or hand each element of the tensor in a loop. Internally the macro tries to determine a suitable call for either a scalar or tensor argument. If the called procedure is unique this will likely succeed. In case of heavily overloaded symbols (e.g. max) it also tries to determine a match from (if any) additional arguments given to that procedure (and uses their types if they are not column references).

In case at cannot be resolved, you will get an error at compile time to specify idx (per index access) or col (full column access) of the column.

So for example:

f{"asFloat" ~ parseFloat(idx("colName"))}

where parseFloat acts on each element individually. If there is only a single overload (as in case of parseFloat), the input and output types are inferred automatically to be:

  • read tensor colName as a string
  • result type is float

Type information for columns

When accessing columns sometimes it may be necessary or desired to specify the type with which a column should be read. To achieve this, the idx and col helpers mentioned in the previous section can be given an explicit type:

idx("foo", int) # <- reads "foo" as a `Tensor[int]`
col("bar", float) # <- read "baz" as a `Tensor[float]`)

These can be mixed and matched to read differently typed columns in the same formula:

f{"foo" ~ idx("x", int) + idx("yStr", string).parseInt}

for an example.

Type hints

To avoid specifying the types of every input column (as mentioned above using idx and col) and to force a specific return type, type hints can be used.

Type hints are required if the formula macro cannot determine the type required, either input or output. This is usually the case for ambiguous operations (overloaded procedures, only a single column without any operations, etc.). They are of the form:

  • <type>: <actualFormula>: simple type hint for the type of the underlying tensor of the columns involved in the formula.
  • <type> -> <resDtype>: <actualFormula>: full type for closure. <type> is the dtype used for input tensors, <resDtype> the resulting type.

For example:

f{int -> int: `x` * `y`}
# ^--- type of the tensors involved on the RHS. Will be read as integers
#        ^--- type of the resulting tensor

In this case the type would be determined to be float by the macro, so type hints are required in case we need them to be integers.

In addition to looking at symbols in the scope, there is a step involving some simple heuristic rules, e.g. if *, / is involved, it’s assumed that the input tensors are floats and the output as well. If & or $ is involved, it’s assumed to be strings. Finally if and and other logic keywords are used, the result is assumed to be bool (not the input thought!).

const floatSet = toSet(@["+", "-", "*", "/", "mod"])
const stringSet = toSet(@["&", "$"])
const boolSet = toSet(@["and", "or", "xor", ">", "<", ">=", "<=", "==", "!=",
                        "true", "false", "in", "notin"])

Notes on formula macro internals

For an insight into the implementation details, ideas and development notes, check out the following document:

https://github.com/SciNim/Datamancer/blob/master/notes/formula_dev_notes.org

More Repositories

1

nim-plotly

plotly wrapper for nim-lang
Nim
179
star
2

Unchained

A fully type safe, compile time only units library.
Nim
110
star
3

numericalnim

A collection of numerical methods written in Nim
Nim
87
star
4

flambeau

Nim bindings to libtorch
Nim
84
star
5

getting-started

Getting started with Nim for Scientific Computing
Nim
61
star
6

nimjl

A bridge between Nim-lang and Julia !
Nim
49
star
7

nimcuda

Nim bindings for CUDA
HTML
42
star
8

nimblas

BLAS for Nim
Nim
42
star
9

scinim

The core types and functions of the SciNim ecosystem
Nim
37
star
10

theo

Theo is an optimized bigint and number theory library for Nim
Nim
26
star
11

rnim

A bridge between R and Nim
Nim
24
star
12

freccia

Apache Arrow implementation in Nim
Nim
22
star
13

Measuremancer

A library to handle measurement uncertainties & error propagation
Nim
21
star
14

nimcl

High level wrapper over OpenCL
Nim
19
star
15

nimfftw3

Nim bindings for FFTW3.
Nim
18
star
16

astGrad

Symbolic differentiation based on the Nim AST
Nim
17
star
17

impulse

Impulse will be a collection of primitives for signal processing (FFT, Convolutions, ...)
Nim
16
star
18

nimlapack

LAPACK bindings for Nim
Nim
16
star
19

vectorize

SIMD vectorization backend
Nim
15
star
20

nim-constants

Mathematical numerical named static constants useful for different disciplines
Nim
13
star
21

polynumeric

A nim module to handle polynomials
Nim
12
star
22

xrayAttenuation

A library to compute transmission / absorption of X-rays through matter
Nim
10
star
23

bio

A library to work with biological sequences.
Nim
7
star
24

fmu.nim

create your own multiplatform FMU (Functional Mock-up Interface) models with Nim
Nim
6
star
25

horizonsAPI

API for the JPL Horizons tool to retrieve solar system parameters
Nim
5
star
26

symengine.nim

A Nim wrapper of SymEngine
Nim
5
star
27

Arrayfire-Nim

Fork of https://github.com/bitstormFA/ArrayFire-Nim
Nim
5
star
28

runtimeGrad

A playground for different runtime based automatic differentiation implementations
Nim
4
star
29

netcdf

NetCDF bindings for Nim
Nim
2
star
30

mcpl

A wrapper for the Monte Carlo Particle Lists (MCPL) library
Nim
2
star
31

mclimit

A Nim port of the mclimit confidence limit computation library
C++
1
star