• Stars
    star
    357
  • Rank 119,149 (Top 3 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A binary for parallel copying of CSV data into a TimescaleDB hypertable

timescaledb-parallel-copy

timescaledb-parallel-copy is a command line program for parallelizing PostgreSQL's built-in COPY functionality for bulk inserting data into TimescaleDB.

Getting started

You need the Go runtime (1.13+) installed, then simply go get this repo:

$ go install github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy@latest

Before using this program to bulk insert data, your database should be installed with the TimescaleDB extension and the target table should already be made a hypertable.

Using timescaledb-parallel-copy

If you want to bulk insert data from a file named foo.csv into a (hyper)table named sample in a database called test:

# single-threaded
$ timescaledb-parallel-copy --db-name test --table sample --file foo.csv

# 2 workers
$ timescaledb-parallel-copy --db-name test --table sample --file foo.csv \
    --workers 2

# 2 workers, report progress every 30s
$ timescaledb-parallel-copy --db-name test --table sample --file foo.csv \
    --workers 2 --reporting-period 30s

# Treat literal string 'NULL' as NULLs:
$ timescaledb-parallel-copy --db-name test --table sample --file foo.csv \
    --copy-options "NULL 'NULL' CSV"

Other options and flags are also available:

$ timescaledb-parallel-copy --help

Usage of timescaledb-parallel-copy:
  -batch-size int
        Number of rows per insert (default 5000)
  -columns string
        Comma-separated columns present in CSV
  -connection string
        PostgreSQL connection url (default "host=localhost user=postgres sslmode=disable")
  -copy-options string
        Additional options to pass to COPY (e.g., NULL 'NULL') (default "CSV")
  -db-name string
        Database where the destination table exists
  -escape character
        The ESCAPE character to use during COPY (default '"')
  -file string
        File to read from rather than stdin
  -header-line-count int
        Number of header lines (default 1)
  -limit int
        Number of rows to insert overall; 0 means to insert all
  -log-batches
        Whether to time individual batches.
  -quote character
        The QUOTE character to use during COPY (default '"')
  -reporting-period duration
        Period to report insert stats; if 0s, intermediate results will not be reported
  -schema string
        Destination table's schema (default "public")
  -skip-header
        Skip the first line of the input
  -split string
        Character to split by (default ",")
  -table string
        Destination table for insertions (default "test_table")
  -truncate
        Truncate the destination table before insert
  -verbose
        Print more information about copying statistics
  -version
        Show the version of this tool
  -workers int
        Number of parallel requests to make (default 1)

Purpose

PostgreSQL native COPY function is transactional and single-threaded, and may not be suitable for ingesting large amounts of data. Assuming the file is at least loosely chronologically ordered with respect to the hypertable's time dimension, this tool should give you great performance gains by parallelizing this operation, allowing users to take full advantage of their hardware.

This tool also takes care to ingest data in a more efficient manner by roughly preserving the order of the rows. By taking a "round-robin" approach to sharing inserts between parallel workers, the database has to switch between chunks less often. This improves memory management and keeps operations on the disk as sequential as possible.

Contributing

We welcome contributions to this utility, which like TimescaleDB is released under the Apache2 Open Source License. The same Contributors Agreement applies; please sign the Contributor License Agreement (CLA) if you're a new contributor.

Running Tests

Some of the tests require a running Postgres database. Set the TEST_CONNINFO environment variable to point at the database you want to run tests against. (Assume that the tests may be destructive; in particular it is not advisable to point the tests at any production database.)

For example:

$ createdb gotest
$ TEST_CONNINFO='dbname=gotest user=myuser' go test -v ./...

More Repositories

1

timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
C
17,361
star
2

promscale

[DEPRECATED] Promscale is a unified metric and trace observability backend for Prometheus, Jaeger and OpenTelemetry built on PostgreSQL and TimescaleDB.
Go
1,331
star
3

tsbs

Time Series Benchmark Suite, a tool for comparing and evaluating databases for time series data
Go
1,260
star
4

pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
Rust
924
star
5

tobs

tobs - The Observability Stack for Kubernetes. Easy install of a full observability stack into a k8s cluster with Helm charts.
Shell
559
star
6

pgai

Bring AI models closer to your PostgreSQL data
Python
476
star
7

timescaledb-tune

A tool for tuning TimescaleDB for better performance by adjusting settings to match your system's CPU and memory resources.
Go
433
star
8

timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
Rust
362
star
9

prometheus-postgresql-adapter

Use PostgreSQL as a remote storage database for Prometheus
Go
335
star
10

timescaledb-docker

Release Docker builds of TimescaleDB
Dockerfile
288
star
11

helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Shell
263
star
12

pg_prometheus

PostgreSQL extension for Prometheus data
C
213
star
13

timescaledb-docker-ha

Create Docker images containing TimescaleDB, Patroni to be used by developers and Kubernetes.
Python
152
star
14

examples

Collection of example applications and tools to help you get familiar with TimescaleDB
JavaScript
120
star
15

nft-starter-kit

Timescale NFT Starter Kit
Python
114
star
16

vector-cookbook

Timescale Vector Cookbook. A collection of recipes to build applications with LLMs using PostgreSQL and Timescale Vector.
Jupyter Notebook
99
star
17

outflux

Export data from InfluxDB to TimescaleDB
Go
89
star
18

opentelemetry-demo

A demo system for exploring the tracing features of Promscale
Python
65
star
19

timescaledb-ruby

The timescaledb gem. Pack of helpers to work with TimescaleDB extension in Ruby.
Ruby
62
star
20

streaming-replication-docker

TimescaleDB Streaming Replication in Docker
Shell
56
star
21

docs

Timescale product documentation 📖
JavaScript
50
star
22

pgspot

Spot vulnerabilities in postgres SQL scripts
Python
50
star
23

timescaledb-extras

Helper functions and procedures for timescale
PLpgSQL
44
star
24

benchmark-postgres

Tools for benchmarking TimescaleDB vs PostgreSQL
Go
38
star
25

docs.timescale.com-content

Content pages for TimescaleDB documentation
JavaScript
37
star
26

promscale_extension

[DEPRECATED] Tables, types and functions supporting Promscale
PLpgSQL
37
star
27

timescaledb-backup

Go
33
star
28

timescaledb-wale

Dockerized WAL-E with an HTTP API
Python
21
star
29

python-vector

Jupyter Notebook
19
star
30

terraform-provider-timescale

Timescale Cloud Terraform Provider
Go
18
star
31

pg_influx

InfluxDB Line Protocol Listener for PostgreSQL
C
17
star
32

homebrew-tap

TimescaleDB Homebrew tap, containing formulas for the database, tools, etc.
Ruby
16
star
33

tsv-timemachine

Sample application for time aware RAG with Streamlit, LlamaIndex and Timescale Vector. Learn more at https://www.timescale.com/ai
Python
15
star
34

templates

Templates to get started with Timescale on Finance or Sensors (IoT)
PLpgSQL
12
star
35

rag-is-more-than-vector-search

Companion repo to "RAG is more than vector search" blog post
Python
12
star
36

promscale-benchmark

Makefile
8
star
37

timescale-extension-utils-rs

Rust
5
star
38

unstructured-pgai-example

Example showing unstructured.io + timescaledb + PGAI
Python
5
star
39

doctor

Rule-based recommendations about your timeseries database.
Python
4
star
40

web-developer-assignment

HTML
3
star
41

wikistream-docker

A Docker environment for https://github.com/timescale/wikistream
Shell
3
star
42

mta-timescale

Demo: Load MTA bus feeds into TimescaleDB
3
star
43

cloud-actions

Cloud public actions
Shell
3
star
44

migration-eval

Tools to determine a migration strategy based on your database
Shell
3
star
45

docker-dbt

Dockerfiles for dbt
Python
2
star
46

aws-lambda-example

A sample serverless AWS Lambda time-series application.
Python
2
star
47

frontend-developer-assignment

HTML
2
star
48

pg_traceam

Simple table access method that just prints out what functions in the access methods and related functions that are called.
C
2
star
49

state_of_postgres

2019
SCSS
1
star
50

build-actions

GitHub actions for release pipelines (building, publishing, checking, etc.)
Shell
1
star
51

pgschema

1
star
52

docs-htmltojsx

A fork of react-magic html-to-jsx specifically modified to parse timescale docs
JavaScript
1
star
53

postgres_cheat_sheet

1
star
54

promscale_specs

Formal specifications for Promscale components
TLA
1
star
55

integrate-with-timescale-using-python

Best practice for interacting with your Timescale service programatically
1
star