• Stars
    star
    8,376
  • Rank 4,387 (Top 0.09 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 8 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatic extraction of relevant features from time series:

tsfresh

Documentation Status Build Status codecov license Binder Downloads

This repository contains the TSFRESH python package. The abbreviation stands for

"Time Series Feature extraction based on scalable hypothesis tests".

The package provides systematic time-series feature extraction by combining established algorithms from statistics, time-series analysis, signal processing, and nonlinear dynamics with a robust feature selection algorithm. In this context, the term time-series is interpreted in the broadest possible sense, such that any types of sampled data or even event sequences can be characterised.

Spend less time on feature engineering

Data Scientists often spend most of their time either cleaning data or building features. While we cannot change the first thing, the second can be automated. TSFRESH frees your time spent on building features by extracting them automatically. Hence, you have more time to study the newest deep learning paper, read hacker news or build better models.

Automatic extraction of 100s of features

TSFRESH automatically extracts 100s of features from time series. Those features describe basic characteristics of the time series such as the number of peaks, the average or maximal value or more complex features such as the time reversal symmetry statistic.

The features extracted from a exemplary time series

The set of features can then be used to construct statistical or machine learning models on the time series to be used for example in regression or classification tasks.

Forget irrelevant features

Time series often contain noise, redundancies or irrelevant information. As a result most of the extracted features will not be useful for the machine learning task at hand.

To avoid extracting irrelevant features, the TSFRESH package has a built-in filtering procedure. This filtering procedure evaluates the explaining power and importance of each characteristic for the regression or classification tasks at hand.

It is based on the well developed theory of hypothesis testing and uses a multiple test procedure. As a result the filtering process mathematically controls the percentage of irrelevant extracted features.

The TSFRESH package is described in the following open access paper:

  • Christ, M., Braun, N., Neuffer, J., and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307, p. 72-77, doi: 10.1016/j.neucom.2018.03.067.

The FRESH algorithm is described in the following whitepaper:

  • Christ, M., Kempa-Liehr, A.W., and Feindt, M. (2017). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.

Systematic time-series feature extraction even works for unsupervised problems:

  • Teh, H.Y., Wang, K.I-K., Kempa-Liehr, A.W. (2021). Expect the Unexpected: Unsupervised feature selection for automated sensor anomaly detection. IEEE Sensors Journal 15.16, p. 18033-18046, doi: 10.1109/JSEN.2021.3084970.

Due to the fact that tsfresh basically provides time-series feature extraction for free, you can now concentrate on engineering new time-series, like e.g. differences of signals from synchronous measurements, which provide even better time-series features:

  • Kempa-Liehr, A.W., Oram, J., Wong, A., Finch, M., Besier, T. (2020). Feature engineering workflow for activity recognition from synchronized inertial measurement units. In: Pattern Recognition. ACPR 2019. Ed. by M. Cree et al. Vol. 1180. Communications in Computer and Information Science (CCIS). Singapore: Springer, p. 223–231. doi: 10.1007/978-981-15-3651-9_20.

  • Simmons, S., Jarvis, L., Dempsey, D., Kempa-Liehr, A.W. (2021). Data Mining on Extremely Long Time-Series. In: 2021 International Conference on Data Mining Workshops (ICDMW). Ed. by B. Xue et al. Los Alamitos: IEEE, p. 1057-1066. doi: 10.1109/ICDMW53433.2021.00137.

Systematic time-series features engineering allows to work with time-series samples of different lengths, because every time-series is projected into a well-defined feature space. This approach allows the design of robust machine learning algorithms in applications with missing data.

  • Kennedy, A., Gemma, N., Rattenbury, N., Kempa-Liehr, A.W. (2021). Modelling the projected separation of microlensing events using systematic time-series feature engineering. Astronomy and Computing 35.100460, p. 1–14, doi: 10.1016/j.ascom.2021.100460

Is your time-series classification problem imbalanced? There is a good chance that undersampling of time-series feature matrices might solve your problem:

  • Dempsey, D.E., Cronin, S.J., Mei, S., Kempa-Liehr, A.W. (2020). Automatic precursor recognition and real-time forecasting of sudden explosive volcanic eruptions at Whakaari, New Zealand. Nature Communications 11.3562, p. 1-8, doi: 10.1038/s41467-020-17375-2.

Natural language processing of written texts is an example of applying systematic time-series feature engineering to event sequences, which is described in the following open access paper:

  • Tang, Y., Blincoe, K., Kempa-Liehr, A.W. (2020). Enriching Feature Engineering for Short Text Samples by Language Time Series Analysis. EPJ Data Science 9.26, p. 1–59. doi: 10.1140/epjds/s13688-020-00244-9

Advantages of tsfresh

TSFRESH has several selling points, for example

  1. it is field tested
  2. it is unit tested
  3. the filtering process is statistically/mathematically correct
  4. it has a comprehensive documentation
  5. it is compatible with sklearn, pandas and numpy
  6. it allows anyone to easily add their favorite features
  7. it both runs on your local machine or even on a cluster

Next steps

If you are interested in the technical workings, go to see our comprehensive Read-The-Docs documentation at http://tsfresh.readthedocs.io.

The algorithm, especially the filtering part are also described in the paper mentioned above.

We appreciate any contributions, if you are interested in helping us to make TSFRESH the biggest archive of feature extraction methods in python, just head over to our How-To-Contribute instructions.

If you want to try out tsfresh quickly or if you want to integrate it into your workflow, we also have a docker image available:

docker pull nbraun/tsfresh

Backwards compatibility

If you need to reproduce or update time-series features, which were computed with the matrixprofile feature calculators, you need to create a Python 3.8 environment:

conda create --name tsfresh__py_3.8 python=3.8
conda activate tsfresh__py_3.8
pip install tsfresh[matrixprofile]

Acknowledgements

The research and development of TSFRESH was funded in part by the German Federal Ministry of Education and Research under grant number 01IS14004 (project iPRODICT).

More Repositories

1

turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
C++
612
star
2

devpi-builder

Automatically build and upload python packages to devpi
Python
76
star
3

bonfire

A CLI Graylog Client with Follow Mode
Python
72
star
4

azure-cost-mon

Prometheus exporter for the Azure billing API
Python
61
star
5

documents

Slides produced by Engineers and Data Scientists of Blue Yonder
Jupyter Notebook
51
star
6

airflow-plugin-demo

Example for an airflow plugin
Python
50
star
7

mesos-threshold-oversubscription

Simple, threshold-based oversubscription modules for Apache Mesos
C++
43
star
8

pydse

Python package for dynamic system estimation of time series
Python
40
star
9

postgraas_server

Postgraas is a super simple PostgreSQL-as-a-service
Python
29
star
10

urlquote

Fast url quoting / unquoting for Python
Rust
19
star
11

devpi-cleaner

Batch removal of (development) packages from devpi servers
Python
18
star
12

vikos

A machine learning library for supervised training of parametrized models
Rust
10
star
13

devpi-plumber

devpi-plumber helps to automate and test large devpi installations
Python
10
star
14

presentation_data_science_in_python

Material for the talk "Data Science in Python"
Jupyter Notebook
8
star
15

di-csv2xml

di-csv2xml - a command line tool for converting CSV to Blue Yonder Supply & Demand API compatible XML
Rust
8
star
16

cee_syslog_handler

Python Syslog Logging Handler with CEE Support
Python
7
star
17

gridka2018-dask-workshop

Jupyter Notebook
5
star
18

grammar-of-graphics-minard-example

Recreation of Charles Minard's Data Visualisation in Python using Altair
Python
5
star
19

medbot

Repository of the EuroPython 2015 session: "It's about time to take your medication!" or how to write a friendly reminder bot ;-)
HTML
5
star
20

qsqla

qSQLA is a query builder for SQLAlchemy Core Selectables
Python
4
star
21

NeurobayesCppInterface

Interface to NeuroBayes(R) for academic purpose
C++
3
star
22

typecase

Library providing Algebraic Data Types for Python.
Python
3
star
23

vmprof-viewer-client

Statistical Python profiler with interactive Web UI
Python
3
star
24

vmprof-viewer-server

Statistical Python profiler with interactive Web UI
HTML
1
star
25

thumbling

Python
1
star
26

tdd_exercise_roman

A small exercise for practicing test-driven development (TDD)
1
star
27

ansible-remove-omit-entries

Omit for dictionary and list entries.
Python
1
star