• Stars
    star
    140
  • Rank 261,473 (Top 6 %)
  • Language
    C#
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ParquetSharp is a .NET library for reading and writing Apache Parquet files.

Main logo

Introduction

ParquetSharp is a cross-platform .NET library for reading and writing Apache Parquet files.

It is implemented in C# as a PInvoke wrapper around Apache Parquet C++ to provide high performance and compatibility. Check out ParquetSharp.DataFrame if you need a convenient integration with the .NET DataFrames.

Supported platforms:

Chip Linux Windows macOS
x64 ✔ ✔ ✔
arm64 ✔ ✔
Status
Release Nuget NuGet latest release
Pre-Release Nuget NuGet latest pre-release
CI Build CI Status

Quickstart

The following examples show how to write and then read a Parquet file with three columns representing a timeseries of object-value pairs. These use the low-level API, which is the recommended API and closely maps to the API of Apache Parquet C++.

Writing a Parquet File:

var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = new float[] { /* ... */ };

var columns = new Column[]
{
    new Column<DateTime>("Timestamp"),
    new Column<int>("ObjectId"),
    new Column<float>("Value")
};

using var file = new ParquetFileWriter("float_timeseries.parquet", columns);
using var rowGroup = file.AppendRowGroup();

using (var timestampWriter = rowGroup.NextColumn().LogicalWriter<DateTime>())
{
    timestampWriter.WriteBatch(timestamps);
}
using (var objectIdWriter = rowGroup.NextColumn().LogicalWriter<int>())
{
    objectIdWriter.WriteBatch(objectIds);
}
using (var valueWriter = rowGroup.NextColumn().LogicalWriter<float>())
{
    valueWriter.WriteBatch(values);
}

file.Close();

Reading the file back:

using var file = new ParquetFileReader("float_timeseries.parquet");

for (int rowGroup = 0; rowGroup < file.FileMetaData.NumRowGroups; ++rowGroup) {
    using var rowGroupReader = file.RowGroup(rowGroup);
    var groupNumRows = checked((int) rowGroupReader.MetaData.NumRows);

    var groupTimestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(groupNumRows);
    var groupObjectIds = rowGroupReader.Column(1).LogicalReader<int>().ReadAll(groupNumRows);
    var groupValues = rowGroupReader.Column(2).LogicalReader<float>().ReadAll(groupNumRows);
}

file.Close();

Documentation

For more detailed information on how to use ParquetSharp, see the following documentation:

Rationale

We desired a Parquet implementation with the following properties:

  • Cross platform (originally Windows and Linux - but now also macOS).
  • Callable from .NET Core.
  • Good performance.
  • Well maintained.
  • Close to official Parquet reference implementations.

Not finding an existing solution meeting these requirements, we decided to implement a .NET wrapper around apache-parquet-cpp (now part of Apache Arrow) starting at version 1.4.0. The library tries to stick closely to the existing C++ API, although it does provide higher level APIs to facilitate its usage from .NET. The user should always be able to access the lower-level API.

Performance

The following benchmarks can be reproduced by running ParquetSharp.Benchmark.csproj. The relative performance of ParquetSharp 10.0.1 is compared to Parquet.NET 4.6.2, an alternative open-source .NET library that is fully managed. The Decimal tests focus purely on handling the C# decimal type, while the TimeSeries tests benchmark three columns of the types {int, DateTime, float}. Results are from a Ryzen 5900X on Linux 6.2.7 using the dotnet 6.0.14 runtime.

If performance is a concern for you, we recommend benchmarking your own workloads and testing different encodings and compression methods. For example, disabling dictionary encoding for floating point columns can often significantly improve performance.

Decimal (Read) Decimal (Write) TimeSeries (Read) TimeSeries (Write)
Parquet.NET 1.0x 1.0x 1.0x 1.0x
ParquetSharp 4.0x Faster 3.0x Faster 2.8x Faster 1.5x Faster

Known Limitations

Because this library is a thin wrapper around the Parquet C++ library, misuse can cause native memory access violations.

Typically this can arise when attempting to access an instance whose owner has been disposed. Because some objects and properties are exposed by Parquet C++ via regular pointers (instead of consistently using std::shared_ptr), dereferencing these after the owner class instance has been destructed will lead to an invalid pointer access.

As only 64-bit runtimes are available, ParquetSharp cannot be referenced by a 32-bit project. For example, using the library from F# Interactive requires running fsiAnyCpu.exe rather than fsi.exe.

Building

Building ParquetSharp for Windows requires the following dependencies:

  • Visual Studio 2022 (17.0 or higher)
  • Apache Arrow (12.0.1)

For building Arrow (including Parquet) and its dependencies, we recommend using Microsoft's vcpkg. The build scripts will use an existing vcpkg installation if either of the VCPKG_INSTALLATION_ROOT or VCPKG_ROOT environment variables are defined, otherwise vcpkg will be downloaded into the build directory. Note that the Windows build needs to be done in a Visual Studio Developer PowerShell for the build script to succeed.

Windows (Visual Studio 2022 Win64 solution)

> build_windows.ps1
> dotnet build csharp.test --configuration=Release

Linux and macOS (Makefile)

> ./build_unix.sh
> dotnet build csharp.test --configuration=Release

We have had to write our own FindPackage macros for most of the dependencies to get us going - it clearly needs more love and attention and is likely to be redundant with some vcpkg helper tools.

Contributing

We welcome new contributors! We will happily receive PRs for bug fixes or small changes. If you're contemplating something larger please get in touch first by opening a GitHub Issue describing the problem and how you propose to solve it.

License

Copyright 2018-2021 G-Research

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

consuldotnet

Consul.NET is a .NET client library for the Consul HTTP API
C#
316
star
2

armada

A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
Go
201
star
3

siembol

An open-source, real-time Security Information & Event Management tool based on big data technologies, providing a scalable, advanced security analytics framework.
Java
188
star
4

spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Scala
138
star
5

ahocorasick_rs

Check for multiple patterns in a single string at the same time: a fast Aho-Corasick algorithm for Python
Python
127
star
6

fasttrackml

Experiment tracking server focused on speed and scalability
Go
97
star
7

grpc_async_examples

C++
49
star
8

TypeEquality

Type equality for F#
F#
43
star
9

spark-dgraph-connector

A connector for Apache Spark and PySpark to Dgraph databases.
Scala
40
star
10

geras

Geras provides a Thanos Store API for the OpenTSDB HTTP API. This makes it possible to query OpenTSDB via PromQL, through Thanos.
Go
38
star
11

prommsd

Go
30
star
12

thanos-remote-read

Adapter to query Thanos StoreAPI with Prometheus remote read support.
Go
30
star
13

fsharp-formatting-conventions

G-Research F# code formatting guidelines
18
star
14

ParquetSharp.DataFrame

ParquetSharp.DataFrame is a .NET library for reading and writing Apache Parquet files into/from .NET DataFrames, using ParquetSharp
C#
18
star
15

Peregrine

F#
14
star
16

Tack

A DotNet tool that can be used to get filter projects and associated output assemblies from solutions
C#
12
star
17

ProjectLinter

An MSBuild project file linter to validate project file as part of build process
C#
12
star
18

DotNetDockerTest

C#
12
star
19

SolutionValidator

A tool for validating solution files and viewing project dependencies
C#
12
star
20

Bulldog

An opinionated base library for building dotnet tools
C#
12
star
21

VsTestRunner

A DotNet tool which can be used to run dotnet vstest across a set of assemblies
C#
12
star
22

fast-string-search

Python
12
star
23

NuGetPackageChecker

An MSBuild extension to check for required packages and versions
C#
12
star
24

ApiSurface

F#
11
star
25

HiddenWindow

C#
10
star
26

dgraph-dbpedia

Pre-processing DBpedia datasets to load into Dgraph
Scala
10
star
27

fsharp-analyzers

Analyzers for F#
F#
8
star
28

yunikorn-history-server

A service to store and provide historical data for K8S clusters using the Yunikorn scheduler
Go
8
star
29

charts

Repository for all of G Research-hosted helm charts
Mustache
7
star
30

opentsdb-tsuid-ratelimiter

Java
7
star
31

DotNetPerfMonitor

Monitoring performance of the .NET ecosystem (NuGet, MsBuild, C#, F#)
PowerShell
6
star
32

dgraph-lanl-csr

Project to load the "Comprehensive, Multi-Source Cyber-Security Events" dataset into a Dgraph cluster.
Scala
6
star
33

NuPerfMonitor

Monitoring performance of NuGet package manager
PowerShell
5
star
34

fasttrackml-ui-aim

Modern Aim UI built for FastTrackML
Go
5
star
35

prometheus-config-loader

Go
4
star
36

PalletJack

Parquet extension
Python
4
star
37

brand

G-Research branding assets
4
star
38

System.Net.Http.JsonExtensions

C#
2
star
39

armada-jupyter

Python
2
star
40

go-ntlm-auth

Go
2
star
41

siembol-config

A Siembol configuration repository for a Siembol quickstart demo
2
star
42

tfe-plan-bot

Terraform Enterprise/Cloud Plan Bot
Go
1
star
43

fasttrackml-ui-mlflow

Classic MLFlow UI built for FastTrackML
Go
1
star
44

astral

Ruby
1
star
45

bearcat

Python
1
star