• This repository has been archived on 13/Aug/2024
  • Stars
    star
    220
  • Rank 179,773 (Top 4 %)
  • Language
    Rust
  • Created over 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A system to programmatically run data pipelines

Factotum

Release Apache License 2.0

A dag running tool designed for efficiently running complex jobs with non-trivial dependency trees.

The zen of Factotum

  1. A Turing-complete job is not a job, it's a program
  2. A job must be composable from other jobs
  3. A job exists independently of any job schedule

User quickstart

Assuming you're running 64 bit Linux:

wget https://github.com/snowplow/factotum/releases/download/0.6.0/factotum_0.6.0_linux_x86_64.zip
unzip factotum_0.6.0_linux_x86_64.zip
./factotum --version

Factotum requires one argument, which is a factotum factfile that describes the job to run. For example, to run the sample sleep.factfile:

wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/sleep.factfile
./factotum run sleep.factfile

Specifying variables in the job file can be done using --env JSON (or -e JSON). The JSON here is free-form and needs to correspond to the placeholders you've set in your job.

For example, the following will print "hello world!":

wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/variables.factfile
./factotum run variables.factfile --env '{ "message": "hello world!" }'

Starting from an arbitrary task can be done using the --start TASK or -s TASK arguments, where TASK is the name of the task you'd like to start at.

For example, to start at the "echo beta" task in this job, you can run the following:

wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/echo.factfile
./factotum run echo.factfile --start "echo beta"

To get a quick overview of the options provided, you can use the --help or -h argument:

./factotum --help

For more information on this file format and how to write your own jobs, see the Factfile format section below.

Factfile format

Factfiles are self-describing JSON which declare a series of tasks and their dependencies. For example:

{
    "schema": "iglu:com.snowplowanalytics.factotum/factfile/jsonschema/1-0-0",
    "data": {
        "name": "Factotum demo",
        "tasks": [
            {
                "name": "echo alpha",
                "executor": "shell",
                "command": "echo",
                "arguments": [ "alpha" ],
                "dependsOn": [],
                "onResult": {
                    "terminateJobWithSuccess": [],
                    "continueJob": [ 0 ]
                }
            },
            {
                "name": "echo beta",
                "executor": "shell",
                "command": "echo",
                "arguments": [ "beta" ],
                "dependsOn": [ "echo alpha" ],
                "onResult": {
                    "terminateJobWithSuccess": [],
                    "continueJob": [ 0 ]
                }
            },
            {
                "name": "echo omega",
                "executor": "shell",
                "command": "echo",
                "arguments": [ "and omega!" ],
                "dependsOn": [ "echo beta" ],
                "onResult": {
                    "terminateJobWithSuccess": [],
                    "continueJob": [ 0 ]
                }
            }
        ]
    }
}

This example defines three tasks that run shell commands - echo alpha, echo beta and echo omega. echo alpha has no dependencies - it will run immediately. echo beta depends on the completion of the echo alpha task, and so will wait for echo alpha to complete. echo omega depends on the echo beta task, and so will wait for echo beta to be complete before executing.

Given the above, the tasks will be executed in the following sequence: echo alpha, echo beta and finally, echo omega. Tasks can have multiple dependencies in factotum, and tasks that are parallelizable will be run concurrently. Check out the samples for more sample factfiles or the wiki for a more complete description of the factfile format.

Developer quickstart

Factotum is written in Rust.

Using Vagrant

  • Clone this repository - git clone [email protected]:snowplow/factotum.git
  • cd factotum
  • Set up a Vagrant box and ssh into it - vagrant up && vagrant ssh
    • This will take a few minutes
  • cd /vagrant
  • Compile and run a demo - cargo run -- run samples/echo.factfile

Using stable Rust without Vagrant

  • Install Rust
    • on Linux/Mac - curl -sSf https://static.rust-lang.org/rustup.sh | sh
  • Clone this repository - git clone [email protected]:snowplow/factotum.git
  • cd factotum
  • Compile and run a demo - cargo run -- run samples/echo.factfile

Copyright and license

Factotum is copyright 2016-2021 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

schema-guru

JSONs -> JSON Schema
Scala
151
star
2

spark-example-project

A Spark WordCountJob example as a standalone SBT project with Specs2 tests, runnable on Amazon EMR
Scala
118
star
3

codeigniter-paypal-ipn

A CodeIgniter library for working with the PayPal IPN (Instant Payment Notification) service
PHP
111
star
4

spark-streaming-example-project

A Spark Streaming job reading events from Amazon Kinesis and writing event counts to DynamoDB
Scala
94
star
5

scalding-example-project

The Scalding WordCountJob example as a standalone SBT project with Specs2 tests, runnable on Amazon EMR
Scala
82
star
6

snowplow-docker

Docker images for Snowplow, Iglu and associated projects
Dockerfile
61
star
7

aws-lambda-scala-example-project

An AWS Lambda function in Scala reading events from Amazon Kinesis and writing event counts to DynamoDB
Scala
57
star
8

symfony2-paypal-ipn

A Symfony2 bundle for working with the PayPal IPN (Instant Payment Notification) service
PHP
56
star
9

sluice

A Ruby toolkit for cloud-friendly ETL
Ruby
38
star
10

google-cloud-dataflow-example-project

Example stream processing job, written in Scala with Apache Beam, for Google Cloud Dataflow
Scala
29
star
11

snowplow-tco-model

UNMAINTAINED. 2013
R
22
star
12

kinesis-example-scala-consumer

Example Scala/SBT event consumer for Amazon Kinesis
Scala
22
star
13

kinesis-example-scala-producer

Example Scala/SBT event producer for Amazon Kinesis
Scala
21
star
14

cloudfront-log-deserializer

A Hive Deserializer for CloudFront access logs (supports download distribution files only)
Java
17
star
15

snowplow.github.com

Legacy Snowplow website, switched off 25 April 2017
HTML
16
star
16

maxmind-geolite-update

A Python script to regularly update MaxMind's free geo databases
Python
15
star
17

icebucket

UNRELEASED. An opinionated framework for analytics-on-write on event streams using key-value storage
Scala
14
star
18

avalanche

Load testing for event analytics platforms (Snowplow, more coming soon)
Scala
13
star
19

dev-environment

Vagrant-based Snowplow development environment with Ansible playbooks to install common tools
Shell
12
star
20

factotum-server

Rust
10
star
21

r-data-science-environment

VM with complete R (RStudio) environment
Shell
9
star
22

prestashop-scala-client

Scala client for the PrestaShop Web Service (aka prestasac)
Scala
9
star
23

engineering-resources

7
star
24

huskimo

🐕 Extracts data from SaaS APIs and stores in Redshift
Scala
7
star
25

bigquery-loader-cli

UNMAINTAINED. Prototype CLI app for uploading Snowplow enriched events to BigQuery
Scala
5
star
26

snowplow-omniture-ingest

Ingests Omniture data (exported as log files) into SnowPlow for more involved analysis
5
star
27

infobright-ruby-loader

A data loader for Infobright, built in Ruby. Modelled on Infobright's own ParaFlex
Ruby
5
star
28

samza-scala-example-project

An Apache Samza stream processing job written in Scala
Scala
5
star
29

redash-java-sdk

Java
4
star
30

nsq-spark-example-project

A Spark job example for integrating NSQ with Spark
Scala
4
star
31

snowplow-gtm-custom-template

GTM Custom Template for the Snowplow JavaScript Tracker (v2)
Smarty
4
star
32

schema-ddl

MOVED. See:
Scala
4
star
33

dataform-data-models

Snowplow Incubator project for Dataform SQL data models for working with Snowplow data. Supports BigQuery only
JavaScript
4
star
34

looker-snowplow-web

A LookML block, that uses data from the Snowplow JavaScript tracker and Web Data Model derived tables and makes it available for exploration in Looker.
LookML
4
star
35

iglu-ruby-client

Ruby and JRuby client for Iglu
Ruby
3
star
36

neo4j-data-science-environment

VM with Neo4j installed
Shell
3
star
37

scala-serf-client

Minimal wrapper around https://github.com/tv2norge/java-serf-client
Scala
3
star
38

sp-js-assets

Contains all of the Snowplow JavaScript Tracker assets.
JavaScript
3
star
39

python-data-science-environment

Shell
3
star
40

snowplow-scala-project.g8

Shell
3
star
41

hive-example-udf

Java
3
star
42

makefile-rs

WIP Rust crate for parsing extremely simple Makefiles
Rust
2
star
43

right-to-be-forgotten-spark-job

Spark job for right to be forgotten
Scala
2
star
44

spark-data-science-environment

VM with Spark ready-to-go
Shell
2
star
45

piinguin

A micro-service to securely store pseudonomized PII data
Scala
2
star
46

graph-event-data-model

Schemas for nodes, relationships and events
2
star
47

event-manifest-cleaner

A Spark job that takes records straight from the failed enriched good directory and deletes exactly those from DynamoDB
Scala
2
star
48

snowplow-piinguin-relay

Snowplow Relay for feeding PII transformation events from Snowplow into Piinguin
Scala
2
star
49

scalacheck-schema

ScalaCheck generators for various Iglu-compatible schema formats
Scala
2
star
50

narcolepsy-scala

A Scala framework for building typesafe clients for RESTful web services
Scala
2
star
51

snowplow-clickhouse-loader

Scala
1
star
52

bintray-usage-alerter

Alerts PagerDuty when malicious downloaders target your Bintray files
Crystal
1
star
53

blob2stream

Reads records from cloud blob storage and writes to cloud stream
Scala
1
star
54

blix-javascript

Blix is a JavaScript library for adding surveys, coupons and flash messages to websites
JavaScript
1
star
55

iglu-objc-client

Objective-C client for Iglu
Objective-C
1
star
56

snowplow-cdc-source

Scala
1
star
57

vendor-matrix

1
star
58

snowplow-azure-data-lake-analytics-extractor

1
star
59

indicative-data-model

A data model for transforming Snowplow Staged Events for Indicative
1
star
60

snowplow-browser-plugin-simple-template

A simple template for creating and publishing a Browser Plugin for the Snowplow JavaScript Trackers
JavaScript
1
star