• Stars
    star
    878
  • Rank 49,802 (Top 2 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated 8 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TiSpark is built for running Apache Spark on top of TiDB/TiKV

TiSpark

Maven Central License

TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV/TiFlash to answer complex OLAP queries. It enjoys the merits of both the Spark platform and the distributed clusters of TiKV/TiFlash while seamlessly integrated to TiDB.

The figure below show the architecture of TiSpark.

architecture

  • TiSpark integrates well with the Spark Catalyst Engine. It provides precise control of computing, which allows Spark to read data from TiKV efficiently. It also supports index seek, which significantly improves the performance of the point query execution.
  • It utilizes several strategies to push down computing to reduce the size of dataset handling by Spark SQL, which accelerates query execution. It also uses the TiDB built-in statistical information for the query plan optimization.
  • From the perspective of data integration, TiSpark + TiDB provides a solution that performs both transaction and analysis directly on the same platform without building and maintaining any ETLs. It simplifies the system architecture and reduces the cost of maintenance.
  • In addition, you can deploy and utilize the tools from the Spark ecosystem for further data processing and manipulation on TiDB. For example, using TiSpark for data analysis and ETL, retrieving data from TiKV as a data source for machine learning, generating reports from the scheduling system and so on.

TiSpark relies on the availability of TiKV clusters and PDs. You also need to set up and use the Spark clustering platform.

Most of the TiSpark logic is inside a thin layer, namely, the tikv-client library.

Doc TOC

About mysql-connector-java

We will not provide the mysql-connector-java dependency because of the limit of the GPL license.

The following versions of TiSpark's jar will no longer include mysql-connector-java.

  • TiSpark > 3.0.1
  • TiSpark > 2.5.1 for TiSpark 2.5.x
  • TiSpark > 2.4.3 for TiSpark 2.4.x

Now, TiSpark needs mysql-connector-java for writing and auth. Please import mysql-connector-java manually when you need to write or auth.

  • you can import it by putting the jar into spark jars file

  • you can also import it when you submit spark job like

spark-submit --jars tispark-assembly-3.0_2.12-3.1.0-SNAPSHOT.jar,mysql-connector-java-8.0.29.jar

Feature Support

Feature Support TiSpark 2.4.x TiSpark 2.5.x TiSpark 3.0.x TiSpark master
SQL select without tidb_catalog βœ” βœ”
SQL select with tidb_catalog βœ” βœ” βœ”
SQL delete from with tidb_catalog βœ” βœ”
DataFrame append βœ” βœ” βœ” βœ”
DataFrame reads βœ” βœ” βœ” βœ”

see here for more detail.

Limitations

  • TiDB starts to support view since tidb-3.0. TiSpark currently does not support view. Users are not be able to observe or access data through view with TiSpark.

  • Spark config spark.sql.runSQLOnFiles should not be set to false, or you may got Error in query: Table or view not found error.

  • Using the style of "{db}.{table}.{colname}" in the condition is not supported, e.g. select * from t where db.t.col1 = 1.

  • Null in aggregration is not supported, e.g. select sum(null) from t group by col1.

  • The dependency tispark-assembly should not be packaged into JAR of JARS file (for example, build with spring-boot-maven-plugin), or you will get ClassNotFoundException. You can solve it by adding spark-wrapper-spark-version in your dependency or constructing another forms of jar file.

  • TiSpark doesn't support GBK character set.

  • TiSpark doesn't support the whole collations rule. Currently, TiSpark only supports the following collations: utf8_bin, utf8_general_ci, utf8_unicode_ci, utf8mb4_bin, utf8mb4_general_ci and utf8mb4_unicode_ci.

  • If spark.sql.ansi.enabled is false an overflow of sum(bigint) will not cause an error but β€œwrap” the result, or you can cast bigint to decimal to avoid the overflow.

  • TiSpark supports retrieving data from table with Expression Index, but the Expression Index will not be used by the planner of TiSpark.

Follow us

Twitter

@PingCAP

Forums

For English users, go to TiDB internals.

For Chinese users, go to AskTUG.

License

TiSpark is under the Apache 2.0 license. See the LICENSE file for details.

More Repositories

1

tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://tidbcloud.com/free-trial
Go
35,910
star
2

talent-plan

open source training courses about distributed database and distributed systems
Rust
9,789
star
3

awesome-database-learning

A list of learning materials to understand databases internals
8,721
star
4

docs-cn

TiDB/TiKV/PD δΈ­ζ–‡ζ–‡ζ‘£
Shell
1,800
star
5

ossinsight

Analysis, Comparison, Trends, Rankings of Open Source Software, you can also get insight from more than 6 billion with natural language (powered by OpenAI). Follow us on Twitter: https://twitter.com/ossinsight
TypeScript
1,585
star
6

parser

A MySQL Compatible SQL Parser
Go
1,392
star
7

tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
Go
1,177
star
8

tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
C++
922
star
9

failpoint

An implementation of failpoints for Golang.
Go
800
star
10

go-ycsb

A Go port of Yahoo! Cloud Serving Benchmark (YCSB)
Go
568
star
11

docs

TiDB database documentation.
Python
558
star
12

dm

Data Migration Platform
Go
455
star
13

tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Go
403
star
14

tiup

A component manager for TiDB
Go
399
star
15

ossinsight-lite

🚧[WIP] Yet another customizable free GitHub stats dashboard based on TiDB Serverless: https://ossinsight-lite.vercel.app, hand-drawn style.
TypeScript
375
star
16

presentations

361
star
17

tidb-docker-compose

Python
335
star
18

tidb-ansible

Python
325
star
19

tla-plus

TLA
289
star
20

tidb-binlog

A tool used to collect and merge tidb's binlog for real-time data backup and synchronization.
Go
287
star
21

dumpling

Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).
Go
280
star
22

tidb-tools

tidb-tools are some useful tool collections for TiDB.
Go
279
star
23

community

TiDB community content
258
star
24

chaos

A test framework for linearizability check with Go
Go
203
star
25

tidb-dashboard

A Web UI for monitoring, diagnosing and managing the TiDB cluster.
TypeScript
169
star
26

go-tpc

A toolbox to benchmark TPC workloads in Go
Go
161
star
27

kvproto

Protocol buffer files for TiKV
CMake
149
star
28

tidb-lightning

This repository has been moved to https://github.com/pingcap/br
Go
142
star
29

tipocket

A toolkit for testing TiDB
Go
135
star
30

blog-cn

Shell
126
star
31

br

A command-line tool for distributed backup and restoration of the TiDB cluster data
Go
123
star
32

tidb-dev-guide

A comprehensive development guide to help you be more and more familiar with the TiDB community and become an expert finally.
115
star
33

tidb-bench

A Simple Benchmark For TiDB
C
106
star
34

gdocwiki

A wiki based on Google Doc / Drive
TypeScript
94
star
35

tipb

TiDB protobuf
CMake
92
star
36

tidb-map

A series of maps to help users and contributors
91
star
37

style-guide

Style guide for PingCAP and TiKV code
78
star
38

go-randgen

a QA tool to random generate sql by bnf pattern
Go
73
star
39

benchmarksql

Unofficial mirror of benchmarksql on github
Java
69
star
40

mysql-tester

A Golang implementation of MySQL Test Framework
Go
60
star
41

weekly

57
star
42

tidb-prisma-vercel-demo

Virtual online bookstore application demo which you can find books of various categories and rate the books.
TypeScript
52
star
43

advanced-statefulset

Go
52
star
44

blog

Python
48
star
45

tiproxy

Go
47
star
46

docs-tidb-operator

Documentation for TiDB on Kubernetes in both English and Chinese.
Python
45
star
47

tikv-client-lib-java

TiKV Java client library
Java
44
star
48

tidiff

A toolset to improve efficiency
Go
41
star
49

meetup

36
star
50

fn

Go
32
star
51

thirdparty-ops

This repo is used for the operation and maintenance of third party tools.
Python
32
star
52

dead-mans-switch

A bypass monitoring prober
Go
32
star
53

tiunimanager

TiUniManager
Go
32
star
54

ng-monitoring

Go
30
star
55

tidb-inspect-tools

Python
28
star
56

tidb-vision

TiDB data visualization
JavaScript
28
star
57

tidb-course-201-lab

Lab scripts for the PingCAP training course: TiDB SQL for Developers.
Shell
26
star
58

django-tidb

TiDB dialect for Django
Python
25
star
59

activerecord-tidb-adapter

TiDB adapter for ActiveRecord, allows the use of TiDB as a backend for ActiveRecord and Rails apps.
Ruby
24
star
60

website-docs

The next generation of PingCAP Docs. Powered by Gatsby βš›οΈ.
TypeScript
23
star
61

monitoring

Shell
23
star
62

diag

A tool to collect diagnostic data from TiDB Clusters
Go
23
star
63

docs-dm

Documentation for the TiDB Data Migration (DM) tool in both English and Chinese.
Python
22
star
64

book.tidb.net

JavaScript
21
star
65

kdt

Kernel Debug Toolkit
Shell
20
star
66

log

Go
16
star
67

octopus

A toolkit including many powerful distributed test tools
Go
15
star
68

Auto-GPT-TiDB-Serverless-Plugin

Python
15
star
69

errcode

Go
14
star
70

tidb_workload_analysis

Go
13
star
71

dbt-tidb

A dbt adapter for TiDB
Python
12
star
72

tidb-loadbalance

Java
11
star
73

website

The website of PingCAP. Powered by Gatsby βš›οΈ and Rocket πŸš€.
JavaScript
11
star
74

tidb.ai

A [WIP] out-of-the-box RAG (Retrieval-Augmented Generation) app based on the [WIP] vector storage in TiDB Serverless.
TypeScript
11
star
75

tpcc-mysql

forked from https://code.launchpad.net/~percona-dev/perconatools/tpcc-mysql
C
11
star
76

tidb-insight

Python
11
star
77

k8s-fluent-bit-stackdriver

Shell
10
star
78

tiunimanager-ui

A web UI for TiUniManager
TypeScript
9
star
79

tidb-ctl

TiDB Controller
Go
9
star
80

askdb

Chat to query Hacker News database, based on Auto-GPT and TiDB Cloud Serverless Database
TypeScript
9
star
81

tso

Timestamp Oracle
Go
8
star
82

tidb-cloud-backup

Go
8
star
83

docs-appdev

Python
7
star
84

LinguFlow

Python
7
star
85

sqlalchemy-tidb

Python
6
star
86

etcdv3-gateway

Gateway for etcdv3
Go
6
star
87

tidb-academy-labs

6
star
88

tispark-test

C
6
star
89

oasis

Python
5
star
90

homebrew-brew

Homebrew taps for TiDB
Ruby
5
star
91

wordpress-tidb-docker

WordPress x TiDB Serverless Tier Cluster
Shell
5
star
92

mysqlrelay

Go
4
star
93

tidb-lmdb

lmdb as storage engine for tidb
Go
4
star
94

cloud-assets-utils

Cloud assets utils by PingCAP FE.
OCaml
4
star
95

mpdriver

MySQL Protocol Driver, used to record MySQL query commands..
Go
4
star
96

wordpress-tidb-plugin

PHP
4
star
97

sysutil

sysutil is a library which implementats the gRPC service Diagnostics and shares the diagnostics functions between TiDB and PD.
Go
4
star
98

tidb-helper

Shell
3
star
99

vldb-boss-2018

Slides and links for VLDB BOSS 2018
3
star
100

sqlgram

TiDB SQL
HTML
3
star