• Stars
    star
    748
  • Rank 58,171 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Library for exploring and validating machine learning data

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

NOTE: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 20.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.40.0 6.0.0 nightly (2.x) 1.13.1 n/a 1.13.0
1.13.0 2.40.0 6.0.0 2.12 1.13.1 n/a 1.13.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

More Repositories

1

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
181,486
star
2

models

Models and examples built with TensorFlow
Python
76,563
star
3

tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
TypeScript
18,104
star
4

tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Python
14,693
star
5

tfjs-models

Pretrained models for TensorFlow.js
TypeScript
13,679
star
6

playground

Play with neural networks!
TypeScript
11,585
star
7

tfjs-core

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
TypeScript
8,491
star
8

examples

TensorFlow examples
Jupyter Notebook
7,681
star
9

tensorboard

TensorFlow's Visualization Toolkit
TypeScript
6,533
star
10

tfjs-examples

Examples built with TensorFlow.js
JavaScript
6,428
star
11

nmt

TensorFlow Neural Machine Translation Tutorial
Python
6,315
star
12

swift

Swift for TensorFlow
Jupyter Notebook
6,118
star
13

serving

A flexible, high-performance serving system for machine learning models
C++
6,068
star
14

docs

TensorFlow documentation
Jupyter Notebook
6,023
star
15

tpu

Reference models and tools for Cloud TPUs.
Jupyter Notebook
5,177
star
16

rust

Rust language bindings for TensorFlow
Rust
4,939
star
17

lucid

A collection of infrastructure and tools for research in neural network interpretability.
Jupyter Notebook
4,611
star
18

datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Python
4,156
star
19

probability

Probabilistic reasoning and statistical analysis in TensorFlow
Jupyter Notebook
4,053
star
20

adanet

Fast and flexible AutoML with learning guarantees.
Jupyter Notebook
3,474
star
21

hub

A library for transfer learning by reusing parts of TensorFlow models.
Python
3,436
star
22

minigo

An open-source implementation of the AlphaGoZero algorithm
C++
3,428
star
23

skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Python
3,185
star
24

lingvo

Lingvo
Python
2,782
star
25

graphics

TensorFlow Graphics: Differentiable Graphics Layers for TensorFlow
Python
2,738
star
26

agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Python
2,717
star
27

ranking

Learning to Rank in TensorFlow
Python
2,713
star
28

federated

A framework for implementing federated learning
Python
2,271
star
29

tfx

TFX is an end-to-end platform for deploying production ML pipelines
Python
2,073
star
30

privacy

Library for training machine learning models with privacy for training data
Python
1,862
star
31

fold

Deep learning with dynamic computation graphs in TensorFlow
Python
1,825
star
32

recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Python
1,739
star
33

quantum

Hybrid Quantum-Classical Machine Learning in TensorFlow
Python
1,723
star
34

mlir

"Multi-Level Intermediate Representation" Compiler Infrastructure
1,720
star
35

addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
Python
1,677
star
36

tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).
C++
1,629
star
37

haskell

Haskell bindings for TensorFlow
Haskell
1,558
star
38

mesh

Mesh TensorFlow: Model Parallelism Made Easier
Python
1,540
star
39

model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Python
1,459
star
40

workshops

A few exercises for use at events.
Jupyter Notebook
1,457
star
41

ecosystem

Integration of TensorFlow with other open-source frameworks
Scala
1,362
star
42

gnn

TensorFlow GNN is a library to build Graph Neural Networks on the TensorFlow platform.
Python
1,260
star
43

community

Stores documents used by the TensorFlow developer community
C++
1,239
star
44

model-analysis

Model analysis tools for TensorFlow
Python
1,234
star
45

text

Making text a first-class citizen in TensorFlow.
C++
1,194
star
46

benchmarks

A benchmark framework for Tensorflow
Python
1,130
star
47

tfjs-node

TensorFlow powered JavaScript library for training and deploying ML models on Node.js.
TypeScript
1,048
star
48

similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.
Python
994
star
49

transform

Input pipeline framework
Python
982
star
50

neural-structured-learning

Training neural models with structured signals.
Python
976
star
51

gan

Tooling for GANs in TensorFlow
Jupyter Notebook
907
star
52

compression

Data compression in TensorFlow
Python
820
star
53

swift-apis

Swift for TensorFlow Deep Learning Library
Swift
794
star
54

deepmath

Experiments towards neural network theorem proving
C++
779
star
55

runtime

A performant and modular runtime for TensorFlow
C++
746
star
56

tensorrt

TensorFlow/TensorRT integration
Jupyter Notebook
730
star
57

java

Java bindings for TensorFlow
Java
730
star
58

tfjs-converter

Convert TensorFlow SavedModel and Keras models to TensorFlow.js
TypeScript
697
star
59

io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
C++
686
star
60

docs-l10n

Translations of TensorFlow documentation
Jupyter Notebook
684
star
61

swift-models

Models and examples built with Swift for TensorFlow
Jupyter Notebook
644
star
62

decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Python
643
star
63

tcav

Code for the TCAV ML interpretability project
Jupyter Notebook
612
star
64

recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Cuda
547
star
65

tfjs-wechat

WeChat Mini-program plugin for TensorFlow.js
TypeScript
524
star
66

lattice

Lattice methods in TensorFlow
Python
519
star
67

model-card-toolkit

A toolkit that streamlines and automates the generation of model cards
Python
399
star
68

flutter-tflite

Dart
389
star
69

custom-op

Guide for building custom op for TensorFlow
Smarty
370
star
70

cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Python
364
star
71

mlir-hlo

MLIR
361
star
72

tfjs-vis

A set of utilities for in browser visualization with TensorFlow.js
TypeScript
360
star
73

tflite-support

TFLite Support is a toolkit that helps users to develop ML and deploy TFLite models onto mobile / ioT devices.
C++
354
star
74

profiler

A profiling and performance analysis tool for TensorFlow
TypeScript
344
star
75

fairness-indicators

Tensorflow's Fairness Evaluation and Visualization Toolkit
Jupyter Notebook
330
star
76

moonlight

Optical music recognition in TensorFlow
Python
325
star
77

tfjs-tsne

TypeScript
309
star
78

estimator

TensorFlow Estimator
Python
295
star
79

embedding-projector-standalone

HTML
284
star
80

tfjs-layers

TensorFlow.js high-level layers API
TypeScript
283
star
81

build

Build-related tools for TensorFlow
Shell
248
star
82

kfac

An implementation of KFAC for TensorFlow
Python
195
star
83

tflite-micro-arduino-examples

C++
180
star
84

ngraph-bridge

TensorFlow-nGraph bridge
C++
138
star
85

profiler-ui

[Deprecated] The TensorFlow Profiler (TFProf) UI provides a visual interface for profiling TensorFlow models.
HTML
134
star
86

tensorboard-plugin-example

Python
134
star
87

tfx-addons

Developers helping developers. TFX-Addons is a collection of community projects to build new components, examples, libraries, and tools for TFX. The projects are organized under the auspices of the special interest group, SIG TFX-Addons. Join the group at http://goo.gle/tfx-addons-group
Jupyter Notebook
121
star
88

metadata

Utilities for passing TensorFlow-related metadata between tools
Python
102
star
89

networking

Enhanced networking support for TensorFlow. Maintained by SIG-networking.
C++
96
star
90

tfhub.dev

Python
71
star
91

tfjs-website

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
CSS
69
star
92

java-models

Models in Java
Java
68
star
93

java-ndarray

Java
66
star
94

tfjs-data

Simple APIs to load and prepare data for use in machine learning models
TypeScript
66
star
95

tfx-bsl

Common code for TFX
Python
61
star
96

autograph

Python
50
star
97

model-remediation

Model Remediation is a library that provides solutions for machine learning practitioners working to create and train models in a way that reduces or eliminates user harm resulting from underlying performance biases.
Python
42
star
98

codelabs

Jupyter Notebook
36
star
99

tensorstore

C++
25
star
100

swift-bindings

Swift
25
star