Datamancer
Comparison to other dataframe libraries
Check out the following gist for a comparison of this library with dplyr (R) and pandas (Python):
https://gist.github.com/Vindaar/6908c038707c7d8293049edb3d204f84
Documentation
The documentation is found at:
https://scinim.github.io/Datamancer/
with a short introduction under:
https://scinim.github.io/Datamancer/datamancer.html
and a tutorial for data wrangling with Datamancer available at:
https://scinim.github.io/getting-started/basics/data_wrangling.html
Installation & dependencies
Installation should be just a
nimble install datamancer
away.
Features and formulas
The data frame provides the “5 verbs” of dplyr and more. Main implemented functions:
filter
mutate
,transmute
select
,rename
arrange
summarize
group_by
arrange
inner_join
set_diff
count
bind_rows
gather
unique
,
which are all based on the FormulaNode
object. Basically they all
receive varargs[FormulaNode]
, which is evaluated in context of the
given dataframe.
Supported types
By default the DataFrame
type supports these types as builtins:
float
int
string
bool
Value
(a variant object that can store either of the above)
As of version v0.3.0
very experimental support to extend the
supported types has landed. See the changelog.org for an explanation or
the playground/non_generic_generics.nim example to get an idea.
Essentially any arbitrary type can be used in a DF by extending the
Column
type using some macro magic.
f{}
macro to create formulas
A few words on the Use:
- no infix symbol and only code, which does not involve a column in
the sense defined below in Column access:
f{1 + 2} f{"foo"} f{true}
a
FormulaNode
of kindfkVariable
. Stores the values as aValue
variant object. <-
for assignmentf{"newName" <- "oldName"}
a
FormulaNode
of kindfkAssign
. This does not involve a closure and is just a simple object storing a LHS as a string and the RHS as aValue
(to also support constant columns viaf{"constantCol" <- 5}
). Typically used forrename
or as an argument fortransmute
andmutate
to just rename a column or to assign a constant column.<<
for reduce operationsf{"meanHwy" << mean(`hwy`)}
a
FormulaNode
of kindfkScalar
. Used only forsummarize
and means we reduce a full column to a singleValue
. This generates a closure, which computes the RHS and assigns it to a result variable of typeValue
. Type hints are required (for now) if only a single proc call is involved on the RHS to tell the macro as what to read the column “hwy” and what the result variable is.~
for vector like procf{"xSquared" ~ `x` * `x`}
a
FormulaNode
of kindfkVector
. Used inmutate
,transmute
to calculate a full column. This also generates a closure as the reduce operations<<
does, except here we loop over the length of the DF and access each read tensor via[idx]
.- a formula without any infix symbols will be considered:
fkVariable
if no column involvedfkVector
else
Column access
To access columns in the context of formula, the biggest change occured. In the old formula system, a literal string was attempted to be resolved as a DF column dynamically. Since the new formulas are compiled to closures, this would involve overhead and is thus avoided for clearer separation between columns and real strings. This also helps readers of a formula.
This means:
`columnName`
: accented quotes refer to a DF column. Be careful to only use this for simple letters (no non letter characters or spaces).c"columnName"= : call string literals (by convention use a =c
before the string) are interpreted as a column in the same way as accented quotes, but allow for column names with spaces / non letter characters.idx("columnName"), idx(`columnName`), idx(nimExpressionReturningString)
: refers to a specific element of the referred columncol("columnName"), col(`columnName`), col(nimExpressionReturningString)
: refers to a the full tensor of the referred column- or directly via:
df[nimExpressionReturningString] / df[nimExpressionReturningString][idx]
: to access columns / indices using identifiers / symbols / general expressions that return a string quotes, call string literals or just string literals). This is equivalent toidx
/col
, so the latter are preferred.
The closures take a data frame as an argument, which is named
df
. The df["columnName"]
refers to that argument, although not
literally (it is gen’symmed and df["columnName"]
refers to a
Column
). From that column we get the underlying Tensor
.
In the context of calling procedures, e.g.:
f{someProc(`columnName`)}
it may not be clear whether the procedure is supposed to take the
whole tensor as an argument or hand each element of the tensor in a
loop. Internally the macro tries to determine a suitable call for
either a scalar or tensor argument. If the called procedure is unique
this will likely succeed. In case of heavily overloaded symbols
(e.g. max
) it also tries to determine a match from (if any)
additional arguments given to that procedure (and uses their types if
they are not column references).
In case at cannot be resolved, you will get an error at compile time
to specify idx
(per index access) or col
(full column access) of the column.
So for example:
f{"asFloat" ~ parseFloat(idx("colName"))}
where parseFloat
acts on each element individually. If there is only
a single overload (as in case of parseFloat
), the input and output
types are inferred automatically to be:
- read tensor
colName
as astring
- result type is
float
Type information for columns
When accessing columns sometimes it may be necessary or desired to
specify the type with which a column should be read. To achieve this,
the idx
and col
helpers mentioned in the previous section can be
given an explicit type:
idx("foo", int) # <- reads "foo" as a `Tensor[int]`
col("bar", float) # <- read "baz" as a `Tensor[float]`)
These can be mixed and matched to read differently typed columns in the same formula:
f{"foo" ~ idx("x", int) + idx("yStr", string).parseInt}
for an example.
Type hints
To avoid specifying the types of every input column (as mentioned above
using idx
and col
) and to force a specific return type, type hints
can be used.
Type hints are required if the formula macro cannot determine the type required, either input or output. This is usually the case for ambiguous operations (overloaded procedures, only a single column without any operations, etc.). They are of the form:
<type>: <actualFormula>
: simple type hint for the type of the underlying tensor of the columns involved in the formula.<type> -> <resDtype>: <actualFormula>
: full type for closure.<type>
is the dtype used for input tensors,<resDtype>
the resulting type.
For example:
f{int -> int: `x` * `y`}
# ^--- type of the tensors involved on the RHS. Will be read as integers
# ^--- type of the resulting tensor
In this case the type would be determined to be float by the macro, so type hints are required in case we need them to be integers.
In addition to looking at symbols in the scope, there is a step
involving some simple heuristic rules, e.g. if *
, /
is involved, it’s
assumed that the input tensors are floats and the output as well. If
&
or $
is involved, it’s assumed to be strings.
Finally if and
and other logic keywords are used, the result is
assumed to be bool
(not the input thought!).
const floatSet = toSet(@["+", "-", "*", "/", "mod"])
const stringSet = toSet(@["&", "$"])
const boolSet = toSet(@["and", "or", "xor", ">", "<", ">=", "<=", "==", "!=",
"true", "false", "in", "notin"])
Notes on formula macro internals
For an insight into the implementation details, ideas and development notes, check out the following document:
https://github.com/SciNim/Datamancer/blob/master/notes/formula_dev_notes.org