Centurion
Introduction
Centurion is a JVM (written in Kotlin) toolkit for columnar and streaming formats.
This library allows you to read, write and convert between the following formats:
Readers and writers are compatible with data generated by Apache Spark and does not require you to start a cluster to perform I/O operations.
Schema Conversions
Centurion allows easy conversion of schemas between any of the supported formats, via Centurion's own internal format.
This internal format is a superset of the functionality of all the supported formats, and is intended as an intermediate format only to allow for conversions.
The following table shows how types map between each of the formats.
Centurion Type | Avro | Parquet | Orc | Arrow |
---|---|---|---|---|
Strings | String | Binary (String) | String | Utf8 |
UUID | String (UUID) | Binary (String) | String | Utf8 |
Booleans | Boolean | Boolean | Boolean | Bool |
Int64 | Long | Int64 | Long | Int64 Signed |
Int32 | Int | Int32 | Int | Int32 Signed |
Int16 | N/A (Int) | Int32 (Signed Int16) | Short | Int16 Signed |
Int8 | N/A (Int) | Int32 (Signed Int8) | Byte | Int8 Signed |
Float64 | Double | Double | Double | FloatingPointDouble |
Float32 | Float | Float | Float | FloatingPointSingle |
Enum | Enum | Enum | String | String |
Decimal | Binary / Fixed with annotation Decimal | Decimal(precision, scale) | Decimal) | Decimal |
Varchar | Fixed) | N/A (String) | Varchar | N/A (String) |
TimestampMillis | Long (TimestampMillis) | Int64 (Timestamp) | Timestamp | Timestamp (Millis) |
TimestampMicros | Long (TimestampMicros) | Int64 (Timestamp) | Unsupported | Timestamp (Micros) |
Map | Map | Map | Map | Map |