• Stars
    star
    169
  • Rank 220,130 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 10 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The source repository of the Metanome tool

Metanome

Build Status Coverage Status

The Metanome project is a joint project between the Hasso-Plattner-Institut (HPI) and the Qatar Computing Research Institute (QCRI). Metanome provides a fresh view on data profiling by developing and integrating efficient algorithms into a common tool, expanding on the functionality of data profiling, and addressing performance and scalability issues for Big Data. A vision of the project appears in SIGMOD Record: "Data Profiling Revisited". Please find the the BibTex, EndNote, and ACM Ref citations for referencing the Metanome project in scientific works on the Metanome project website.

The Metanome tool is supplied under Apache License. You can use and extend the tool to develop your own profiling algorithms. The profiling algorithms, which we provide on our Algorithm releases page and the metanome-algorithms repository, have Apache copyright as well.

The Metanome platform itself is a backend service that communicates over an HTTP REST API with a user facing web frontend. This frontend is provided in a seperate Metanome Frontend repository. Building the Metanome tool, therefore, requires you to pull the frontend as a submodule (see below).

Building Metanome Locally

Metanome is a java maven project. So in order to build the sources, the following development tools are needed:

  1. Java JDK 1.8 or later
  2. Maven 3.1.0
  3. Git

Make sure that all three are on your system's PATH variable when running the build.

Note: The build might have issues with the most recent JDK versions, so we recommend a Java version between 8 and 11.

Pull Metanome Frontend Submodule

Before executing the build you have to clone the Metanome Frontend into the project.

git submodule init
git submodule update
Build Metanome

Metanome can be build by executing:

mvn clean install (or for a parallel build mvn -T 1C clean install)

If the frontend build fails due to missing or incompatible Angular packages, it often helps to re-run the build.

When the built has finished, Metanome can be packaged together with a Tomcat webserver, some test data, and some test algorithms. To speedup builds this package is not created in the default maven profile. The deployment package can be created by executing the build with the deployment-local profile:

mvn verify -P deployment-local

or by executing package on the deployment project directly:

mvn -f deployment/pom.xml package

Note that if metanome has not been installed before creating the package (via mvn clean install), dependencies will be retrieved online, which can result in a deprecated package!

To start the Metanome frontend you then have to execute the following steps in the deployment folder:

  1. Unzip deployment/target/deployment-1.1-SNAPSHOT-package_with_tomcat.zip
  2. Go into the unzipped folder and start the run script, either run.sh or run.bat(Windows Systems)
  3. Open a browser at http://localhost:8080/

Downloads

All Metanome releases can be found on the Metanome releases page.

Current profiling algorithms are available at the Algorithm releases page. The sources of all these algorithms are available on GitHub in the metanome-algorithms repository.

Developing a profiling algorithm for Metanome

If you want to build your own profiling algorithm for the Metanome tool, the best way to get started is our Skeleton Project. It contains an algorithm frame and a test runner project, with which you can run and test your code (without a running Metanome tool instance). For more details, check out the contained README.txt file.

Since many profiling algorithms use similar techniques for the discovery of dependendencies, its worth checking out the following resources as well:

Documentation

The Metanome tool, information for algorithm developers and contributors to the project can be found in the github wiki.

Deploy Metanome Remote

It is possible to deploy Metanome using PaaS providers like (Amazon Beanstalk, Heroku or Google App Engine). We provide additional configs and documentation how to deploy Metanome on these in the github wiki.

Development

The Metanome modules are continuously deployed to sonatype and can be used by adding the repository:

<repositories>
    <repository>
        <id>snapshots-repo</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>
</repositories>

Git Commit Hooks

The project is using license-maintainer as Pre-Commit Git Hook to keep the license information in all Java, XML and Python files up to date. To use it you have to execute the ./add_hooks.sh shell script which is creating an pre-commit hook symlink to the license-maintainer script.

Coding style

The project follows the google-styleguide please make sure that all contributions adhere to the correct format. Formatting settings for common ides can be found at: http://code.google.com/p/google-styleguide/ All files should contain the apache copyright header. The header can be found in the COPYRIGHT_HEADER file.

More Repositories

1

TimeEval-algorithms

Time series anomaly detection algorithm implementations for TimeEval (Docker-based)
Python
60
star
2

metanome-algorithms

Source code for several Metanome data profiling algorithms
Java
45
star
3

timeeval-evaluation-paper

Supporting material and website for the paper "Anomaly Detection in Time Series: A Comprehensive Evaluation"
44
star
4

TimeEval

Evaluation Tool for Anomaly Detection Algorithms on Time Series
Jupyter Notebook
38
star
5

gutentag

GutenTAG is an extensible tool to generate time series datasets with and without anomalies.
Python
37
star
6

snowman

Welcome to Snowman App – a Data Matching Benchmark Platform.
TypeScript
37
star
7

TimeEval-GUI

[Read-Only Mirror] Benchmarking Toolkit for Time Series Anomaly Detection Algorithms using TimeEval and GutenTAG
Python
14
star
8

inclusion-dependency-algorithms

This repository provides the implementation of several well-know INDs discovery algorithms
Java
12
star
9

akka-tutorial

Code for the Akka tutorial
Java
11
star
10

Quagga

An email segmentation system (reference implementation of ECIR 2018 paper)
Python
10
star
11

QuaggaLib

An Email Segmentation System
Python
9
star
12

DADS

Distributed detection of sequential anomalies in univariate time series
Java
9
star
13

Pollock

Pollock is a benchmark for data loading on character-delimited files.
Python
8
star
14

Mondrian

Code repository for Mondrian, a project for multiregion template recognition in spreadsheets.
Python
7
star
15

DECENT

Python
7
star
16

enno

Text Annotation tool that is hopefully less painful to use than others
JavaScript
6
star
17

ProLOD

ProLOD++ contains algorithms to perform data profiling on Linked Data.
Scala
5
star
18

DQ4AI

Experimental study of the effects of data quality dimensions on machine learning performance
Jupyter Notebook
4
star
19

ELEX

A graph exploration tool that makes it easier to understand graphs.
JavaScript
4
star
20

DataGossip

DataGossip is an extension for asynchronous distributed data parallel machine learning that improves the training on imbalanced partitions.
Jupyter Notebook
3
star
21

NumbER

Entity Resolution for Numerical Data
Python
3
star
22

ExtracTable

Extract tables from Plain-Text Files.
Jupyter Notebook
2
star
23

S2Gpp

Rust
2
star
24

GAP-Gender-Analysis-for-Publications

The code behind the platform csgender.org
Python
2
star
25

GenderAnalysis

An analysis of gender distribution in scientific publications
Python
2
star
26

wikipedia_cleanup

MP 21/22
Jupyter Notebook
2
star
27

Sawfish

sIND Discovery
Java
2
star
28

s2gpp_experiments

Experiments for Series2Graph++ Paper
Python
2
star
29

ner-text-quality-impact

Python
1
star
30

ChangeTimeSeriesClustering

Scala-Spark Framework to cluster changes in data
Scala
1
star
31

winepi_serial_length2

Jupyter Notebook
1
star
32

flink-kmeans

Scala
1
star
33

Metanome-Frontend

Frontend for the Metanome Project
CSS
1
star
34

SURAGH

The source repository of the SURAGH.
Java
1
star
35

cmt_statistics_tool

CMTStat - a CMT Statistics Tool
Python
1
star
36

spark-tutorial

Code for the Spark tutorial
Scala
1
star
37

spark-kmeans

Scala
1
star
38

TAHARAT

Java
1
star
39

wikipediatablevandalism

A vandalism detection system for table edits on Wikipedia
Jupyter Notebook
1
star
40

art-ner-dataset

Data and code from the paper "Generation of Training Data for Named Entity Recognition of Artworks"
Python
1
star
41

DBS1-Exercise

Kotlin
1
star
42

Armadillo

Table Overlap Approximation and Datasets
Python
1
star