• Stars
    star
    296
  • Rank 140,464 (Top 3 %)
  • Language
    Haskell
  • License
    Other
  • Created almost 10 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data frames for tabular data.

Frames

Data Frames for Haskell

User-friendly, type safe, runtime efficient tooling for working with tabular data deserialized from comma-separated values (CSV) files. The type of each row of data is inferred from data, which can then be streamed from disk, or worked with in memory.

We provide streaming and in-memory interfaces for efficiently working with datasets that can be safely indexed by column names found in the data files themselves. This type safety of column access and manipulation is checked at compile time.

Use Cases

For a running example, we will use variations of the prestige.csv data set. Each row includes 7 columns, but we just want to compute the average ratio of income to prestige.

Clean Data

If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. Frames provides TemplateHaskell machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using tableTypes), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an in-core array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the foldl library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that program.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFold where
import qualified Control.Foldl                 as L
import           Data.Vinyl.Curry               ( runcurryX )
import           Frames

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes "Row" "test/data/prestige.csv"

loadRows :: IO (Frame Row)
loadRows = inCoreAoS (readTable "test/data/prestige.csv")

-- | Compute the ratio of income to prestige for a record containing
-- only those fields.
ratio :: Record '[Income, Prestige] -> Double
ratio = runcurryX (\i p -> fromIntegral i / p)

averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
  where avg = (/) <$> L.sum <*> L.genericLength

Missing Header Row

Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names do come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by rowGen we care to change, passing the result to tableTypes'. Link to code.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFoldNoHeader where
import qualified Control.Foldl                 as L
import           Data.Vinyl.Curry               ( runcurryX )
import           Frames
import           Frames.TH                      ( rowGen
                                                , RowGen(..)
                                                )

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes' (rowGen "test/data/prestigeNoHeader.csv")
            { rowTypeName = "NoH"
            , columnNames = [ "Job", "Schooling", "Money", "Females"
                            , "Respect", "Census", "Category" ]
            , tablePrefix = "NoHead"}

loadRows :: IO (Frame NoH)
loadRows = inCoreAoS (readTableOpt noHParser "test/data/prestigeNoHeader.csv")

-- | Compute the ratio of money to respect for a record containing
-- only those fields.
ratio :: Record '[NoHeadMoney, NoHeadRespect] -> Double
ratio = runcurryX (\m r -> fromIntegral m / r)

averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
  where avg = (/) <$> L.sum <*> L.genericLength

Missing Data

Sometimes not every row has a value for every column. I went ahead and blanked the prestige column of every row whose type column was NA in prestige.csv. For example, the first such row now reads,

"athletes",11.44,8206,8.13,,3373,NA

We can no longer parse a Double for that row, so we will work with row types parameterized by a Maybe type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the prestige column was parsed, only keeping those rows for which it was not, then project the income column from those rows, and finally throw away Nothing elements. Link to code.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}
module UncurryFoldPartialData where
import qualified Control.Foldl as L
import Data.Maybe (isNothing)
import Data.Vinyl.XRec (toHKD)
import Frames
import Pipes (Producer, (>->))
import qualified Pipes.Prelude as P

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
-- The prestige column has been left blank for rows whose "type" is
-- listed as "NA".
tableTypes "Row" "test/data/prestigePartial.csv"

-- | A pipes 'Producer' of our 'Row' type with a column functor of
-- 'Maybe'. That is, each element of each row may have failed to parse
-- from the CSV file.
maybeRows :: MonadSafe m => Producer (Rec (Maybe :. ElField) (RecordColumns Row)) m ()
maybeRows = readTableMaybe "test/data/prestigePartial.csv"

-- | Return the number of rows with unknown prestige, and the average
-- income of those rows.
incomeOfUnknownPrestige :: IO (Int, Double)
incomeOfUnknownPrestige =
  runSafeEffect . L.purely P.fold avg $
    maybeRows >-> P.filter prestigeUnknown >-> P.map getIncome >-> P.concat
  where avg = (\s l -> (l, s / fromIntegral l)) <$> L.sum <*> L.length
        getIncome = fmap fromIntegral . toHKD . rget @Income
        prestigeUnknown :: Rec (Maybe :. ElField) (RecordColumns Row) -> Bool
        prestigeUnknown = isNothing . toHKD . rget @Prestige

Tutorial

For comparison to working with data frames in other languages, see the tutorial.

Demos

There are various demos in the repository. Be sure to run the getdata build target to download the data files used by the demos! You can also download the data files manually and put them in a data directory in the directory from which you will be running the executables.

Contribute

You can build Frames via nix with the following command:

nix build .#Frames-8107  # or nix build .#Frames-921

this creates an ./result link in the current folder.

To get a development shell with all libraries, you can run:

nix develop .#Frames-921

To get just ghc and cabal in your shell, a simple nix develop will do.

Benchmarks

The benchmark shows several ways of dealing with data when you want to perform multiple traversals.

Another demo shows how to fuse multiple passes into one so that the full data set is never resident in memory. A Pandas version of a similar program is also provided for comparison.

This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.

First with Pandas,

$ nix-shell -p 'python3.withPackages (p: [p.pandas])' --run '$(which time) -f "%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB" python benchmarks/panda.py'
28.087476512228815
-81.90356506136422
0.67user 0.04system 0:00.72elapsed 99%CPU; 79376maxresident KB

Then with Frames,

$ $(which time) -f '%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB' dist-newstyle/build/x86_64-linux/ghc-8.10.4/Frames-0.7.2/x/benchdemo/build/benchdemo/benchdemo
28.087476512228815
-81.90356506136422
0.36user 0.00system 0:00.37elapsed 100%CPU; 5088maxresident KB

More Repositories

1

roshask

Haskell client library for the ROS robotics framework.
Haskell
107
star
2

ffmpeg-light

Minimal Haskell bindings to the FFmpeg library
Haskell
66
star
3

GLUtil

Utility functions for working with OpenGL BufferObjects, GLSL shaders, and textures.
Haskell
40
star
4

cabbage

A tool for caching cabal builds in a Nix store
Shell
33
star
5

vinyl-gl

Utilities for working with OpenGL's GLSL shading language and vinyl records.
Haskell
30
star
6

dotfiles

Configuration files (.emacs)
Nix
22
star
7

ros2nix

Use ROS with the Nix package manager
Nix
17
star
8

concurrent-machines

Concurrency features for the Haskell machines package
Haskell
17
star
9

wgsl-mode

Emacs syntax highlighting for the WebGPU Shading Language (WGSL)
Emacs Lisp
17
star
10

LinearLogic

A simple development of linear logic in Coq.
Coq
11
star
11

hpp

hpp - A Haskell Preprocessor
Haskell
9
star
12

BostonHaskell2015

[Talk] Framing the Discussion with EDSLs
Haskell
9
star
13

MetaPragma

LANGUAGE pragma collections for GHC
Haskell
8
star
14

yaml-light-lens

Lenses for working with YAML data.
Haskell
8
star
15

NotesExporter

Export OS X and iOS Notes.app notes to plain text files.
Haskell
7
star
16

pcd-loader

Haskell library for loading PCD files containing point cloud data.
Haskell
6
star
17

CLUtil

Thin abstraction layer over the Haskell OpenCL library.
Haskell
6
star
18

RANSAC

Haskell implementation of the RANSAC algorithm.
Haskell
6
star
19

EasyAudio

A very basic audio playback library leveraging SDL2.
Haskell
5
star
20

emacs-lsp-booster-nix

Nix flake for the emacs-lsp-booster program
Nix
5
star
21

wgsl-ts-mode

WGSL tree-sitter support for emacs
Emacs Lisp
4
star
22

language-c

Temporary fixes for language-c to support OS X 10.9
Haskell
4
star
23

RandomizedClustering

Randomized clustering of multidimensional data, with an application to image segmentation.
C++
3
star
24

summarizer

Very simple text summarization
Haskell
3
star
25

StorableMonad

Small helper for defining Foreign.Storable instances.
Haskell
3
star
26

Frames-dsv

Support library for using `hw-dsv` to parse CSV files for the `Frames` Haskell data frames library
Haskell
2
star
27

DistanceTransform

Parallel linear time n-dimensional Euclidean distance transform.
Haskell
2
star
28

NYHUG

Content generated for the New York Haskell Users Group
2
star
29

ply-loader

Haskell library for loading PLY files containing 3D geometry information.
Haskell
2
star
30

PcdViewer

Viewer for PCD point clouds and depth map generation.
Haskell
2
star
31

Dogleg

Haskell implementation of Powell's "dogleg" nonlinear optimization algorithm.
Haskell
2
star
32

PointCloudViewer

OpenGL point cloud viewer.
Haskell
2
star
33

BinaryCookies

Library for parsing the `.binarycookies` files used by OS X and iOS.
Haskell
2
star
34

dict-lookup.el

Emacs interface to the `dict` dictionary client
Emacs Lisp
1
star
35

wgpu-image-read

Rust
1
star
36

cabal-config

Hacky extraction of linker flags to aid building without GHC invoking the linker.
Haskell
1
star
37

PointConverters

Conversion utilities for point geometry formats.
Haskell
1
star
38

haskell-src-edit

Haskell
1
star
39

divergent

Diverging Color Maps for Scientific Visualization
C++
1
star
40

vector-aligned

Storable vectors aligned at specified boundaries.
Haskell
1
star
41

ExifRenamer

Rename image files based on their original capture date and time.
Haskell
1
star
42

cparens

Fully parenthesize a C expression read from stdin, write the output to stdout
Haskell
1
star
43

pip2nix

A mini generator of Nix derivations from pip-compatible requirements
Python
1
star
44

fixed-linear

Haskell bridge between lthe linear and fixed-vector packages.
Haskell
1
star