• Stars
    star
    289
  • Rank 143,419 (Top 3 %)
  • Language
    HTML
  • License
    MIT License
  • Created over 4 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dataset manipulation library built on the top of tech.ml.dataset

Versions

tech.ml.dataset 7.x (master branch)

tech.ml.dataset 4.x (4.0 branch)

[scicloj/tablecloth "4.04"]

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

  • Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
  • Single entry point for common operations - one function dispatching on given arguments.
  • group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
  • Most operations recognize regular dataset and grouped dataset and process data accordingly.
  • One function form to enable thread-first on dataset.

Important! This library is not the replacement of tech.ml.dataset nor a separate library. It should be considered as a addition on the top of tech.ml.dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

Join the discussion on Zulip

Documentation

Please refer detailed documentation with examples

Usage example

(require '[tablecloth.api :as tc])
(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
    (tc/dataset {:key-fn keyword})
    (tc/group-by (fn [row]
                    {:symbol (:symbol row)
                     :year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
    (tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (tc/order-by [:symbol :year])
    (tc/head 10))

_unnamed [10 3]:

:symbol :year summary
AAPL 2000 21.74833333
AAPL 2001 10.17583333
AAPL 2002 9.40833333
AAPL 2003 9.34750000
AAPL 2004 18.72333333
AAPL 2005 48.17166667
AAPL 2006 72.04333333
AAPL 2007 133.35333333
AAPL 2008 138.48083333
AAPL 2009 150.39333333

Contributing

Tablecloth is open for contribution. The best way to start is discussion on Zulip.

Development tools for documentation

Documentation is written in RMarkdown, that means that you need R to create html/md/pdf files. Documentation contains around 600 code snippets which are run during build. There are two files:

  • README.Rmd
  • docs/index.Rmd

Prepare following software:

  1. Install R
  2. Install rep, nRepl client
  3. Install pandoc
  4. Run nRepl
  5. Run R and install R packages: install.packages(c("rmarkdown","knitr"), dependencies=T)
  6. Load rmarkdown: library(rmarkdown)
  7. Render readme: render("README.Rmd","md_document")
  8. Render documentation: render("docs/index.Rmd","all")

API file generation

tablecloth.api namespace is generated out of api-template, please run it before making documentation

(exporter/write-api! 'tablecloth.api.api-template
                     'tablecloth.api
                     "src/tablecloth/api.clj"
                     '[group-by drop concat rand-nth first last shuffle])

Guideline

  1. Before commiting changes please perform tests. I ususally do: lein do clean, check, test and build documentation as described above (which also tests whole library).
  2. Keep API as simple as possible:
    • first argument should be a dataset
    • if parametrizations is complex, last argument should accept a map with not obligatory function arguments
    • avoid variadic associative destructuring for function arguments
    • usually function should working on grouped dataset as well, accept parallel? argument then (if applied).
  3. Follow potemkin pattern and import functions to the API namespace using tech.v3.datatype.export-symbols/export-symbols function
  4. Functions which are composed out of API function to cover specific case(s) should go to tablecloth.utils namespace.
  5. Always update README.Rmd, CHANGELOG.md, docs/index.Rmd, tests and function docs are highly welcomed
  6. Always discuss changes and PRs first

TODO

  • tests
  • tutorials

Licence

Copyright (c) 2020 Scicloj

The MIT Licence

More Repositories

1

scicloj.ml

A Clojure machine learning library
Clojure
213
star
2

clojisr

Clojure speaks statistics - a bridge between Clojure to R
Clojure
150
star
3

notespace

using your namespace as a notebook
Clojure
148
star
4

clay

A tiny Clojure tool for dynamic workflow of data visualization and literate programming
HTML
120
star
5

clojure-data-cookbook

A book about how to do common data manipulation, analysis, and visualization tasks in Clojure
Clojure
90
star
6

wolframite

An interface between Clojure and Wolfram Language (the language of Mathematica)
Clojure
47
star
7

noj

A clojure framework for data science
Clojure
46
star
8

scicloj-data-science-handbook

Clojure data science handbook - journal style examples of data science
Clojure
35
star
9

metamorph

Context pipelines
Clojure
33
star
10

clj-djl

clojure wrap for deep java library(DJL.ai)
Clojure
31
star
11

sklearn-clj

Plugin to use sklearn models in metamorph.ml
Clojure
29
star
12

viz.clj

A Clojure data visualization library
Clojure
26
star
13

kindly

A small library for defining how different kinds of things should be rendered
Clojure
25
star
14

scicloj.ml-tutorials

Tutorials for scicloj.ml
Clojure
22
star
15

wadogo

scales for clojure
Clojure
18
star
16

metamorph.ml

Machine learning functions for metamorph based on machine learning pipelines
Clojure
18
star
17

tablecloth.time

Tools for the processing and manipulation of time-series data in Clojure.
Clojure
18
star
18

tutorials

A repo for hosting Clojure data science tutorials created by the community
Jupyter Notebook
15
star
19

nov2021-workshops

The November 2021 pre-conference workshops of re:Clojure
Clojure
14
star
20

scicloj.ml.tribuo

Use Tribuo ML model in metamorph.ml
Clojure
11
star
21

cmdstan-clj

Using the Stan statistical modelling language from Clojure using the CmdStan CLI
Clojure
10
star
22

scicloj.ml.smile

A Smile plugin for scicloj.ml
Clojure
9
star
23

clay.el

Emacs bindings for the Clojure Clay tool
Emacs Lisp
9
star
24

clojure-data-scrapbook

community-contributed examples for the emerging Clojure data stack
Clojure
9
star
25

hanamicloth

Easy layered graphics with Hanami & Tablecloth
Clojure
9
star
26

notespace-sicmutils-example

An example of using Notespace to write Sicmutils notes
Clojure
8
star
27

devcontainer-templates

Devcontainer templates for Clojure
Dockerfile
8
star
28

scicloj.ml.xgboost

A xgboost plugin for scicloj.ml
Clojure
7
star
29

cjlpy

Using Python from Clojure
Clojure
6
star
30

ml-study

A repo for the ml study group
HTML
5
star
31

fastr-examples

Experimenting with Clojure-FastR interop
Clojure
5
star
32

clojisr-examples

examples of using clojisr
Clojure
5
star
33

visual-tools-experiments

Experiments of the visual tools group
HTML
5
star
34

python-data-science-handbook-in-clojure

A Clojure port of the code in the Python Data Science Handbook
Clojure
5
star
35

open-source-mentoring

resources for the Scicloj Open Source Mentoring program
5
star
36

tcutils

Utility functions for working with tablecloth datasets
Clojure
4
star
37

scicloj.ml.top2vec

Use top2vec model from Clojure
Clojure
4
star
38

docker-hub

docker containers
Dockerfile
4
star
39

kindly-noted

A common space for notes following the Kindly convention
Clojure
4
star
40

scicloj.github.io

The Scicloj website
HTML
4
star
41

translating-books

a list of books that we wish to translate to Clojure
4
star
42

scicloj.old.replaced-20220218

Source of the old Scicloj website (replaced by scicloj.github.io, 2022-02-18)
HTML
3
star
43

kind-clerk

An adapter for the Clerk tool to support the Kindly conventions
Clojure
3
star
44

metamorph-examples

Clojure
3
star
45

tempfiles

a small Clojure library for managing temporary files
Clojure
3
star
46

kind-portal

An adapter for the Portal tool to support the Kindly conventions
Clojure
3
star
47

gandiva-examples

Trying Gandiva from Clojure
Clojure
2
star
48

sicmutils-drafts

Drafts of notes about Sicmutils
Clojure
2
star
49

kaggle-kernels

Implementing kernels for some kaggle competetions
Clojure
2
star
50

ds4clj

data science for clojure devs course
2
star
51

tensorflow-study

studying tensorflow and its use from Clojure
2
star
52

TensorStandardInterface

An effort towards an idiomatic Clojure interface for Tensors (N-Dimensional Arrays).
Clojure
2
star
53

stats-with-clojure

Teaching statistics with Clojure
Clojure
2
star
54

kindly-advice

a small library to advise Clojure data visualization and notebook tools how to display forms and values, following the kindly convention
Clojure
2
star
55

thinkstats2-clj

Translation of the ThinkStats2 book from Python into Clojure
Clojure
2
star
56

kindly-render

a Clojure library for rendering kinds as markdown or html
Clojure
1
star
57

scicloj.github.com.archived-20220218

Scicloj website - an old version
HTML
1
star
58

workshops

1
star
59

sci-fu

The main repo for the Scicloj Foundations study group
Jupyter Notebook
1
star
60

scicloj.ml.clj-djl

clj-djl models for metamorph.ml and scicloj.ml
Clojure
1
star
61

clojisr-rengine

Just a wrapper to the newest REngine source code
Shell
1
star
62

datarium-CSV

datasets from the datarium R package, converted to CSV format
R
1
star
63

workplan

The SciCloj workplan -- a living organizing document
1
star
64

note-to-test

generating tests automatically from Clojure notes
Clojure
1
star