• Stars
    star
    333
  • Rank 126,599 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 9 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A light-weight wrapper library around Spotify's Luigi workflow library to make writing scientific workflows more fluent, flexible and modular

SciLuigi Logo

CircleCI Build status

Project updates

  • Update Jan 7, 2023: Version (0.10.0 and) 0.10.1 are released, and should work well at least with with Python 3.9 and Luigi 3.1.1. Please report any issues!
  • A paper with the motivation and design decisions behind SciLuigi now available
    • If you use SciLuigi in your research, please cite it like this:
      Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016. doi:10.1186/s13321-016-0179-6.
  • A Virtual Machine with a realistic, runnable, example workflow in a Jupyter Notebook, is available here
  • Watch a 10 minute screencast going through the basics of using SciLuigi here
  • See a poster describing the motivations behind SciLuigi here

About SciLuigi

Scientific Luigi (SciLuigi for short) is a light-weight wrapper library around Spotify's Luigi workflow system that aims to make writing scientific workflows more fluent, flexible and modular.

Luigi is a flexile and fun-to-use library. It has turned out though that its default way of defining dependencies by hard coding them in each task's requires() function is not optimal for some type of workflows common e.g. in bioinformatics where multiple inputs and outputs, complex dependencies, and the need to quickly try different workflow connectivity in an explorative fashion is central to the way of working.

SciLuigi was designed to solve some of these problems, by providing the following "features" over vanilla Luigi:

  • Separation of dependency definitions from the tasks themselves, for improved modularity and composability.
  • Inputs and outputs implemented as separate fields, a.k.a. "ports", to allow specifying dependencies between specific input and output-targets rather than just between tasks. This is again to let such details of the network definition reside outside the tasks.
  • The fact that inputs and outputs are object fields, also allows auto-completion support to ease the network connection work (Works great e.g. with jedi-vim).
  • Inputs and outputs are connected with an intuitive "single-assignment syntax".
  • "Good default" high-level logging of workflow tasks and execution times.
  • Produces an easy to read audit-report with high level information per task.
  • Integration with some HPC workload managers. (So far only SLURM though).

Because of Luigi's easy-to-use API these changes have been implemented as a very thin layer on top of luigi's own API with no changes at all to the luigi core, which means that you can continue leveraging the work already being put into maintaining and further developing luigi by the team at Spotify and others.

Workflow code quick demo

For a brief 10 minute screencast going through the basics below, see this link

Just to give a quick feel for how a workflow definition might look like in SciLuigi, check this code example (implementation of tasks hidden here for brevity. See Usage section further below for more details):

import sciluigi as sl

class MyWorkflow(sl.WorkflowTask):
    def workflow(self):
        # Initialize tasks:
        foowrt = self.new_task('foowriter', MyFooWriter)
        foorpl = self.new_task('fooreplacer', MyFooReplacer,
            replacement='bar')

        # Here we do the *magic*: Connecting outputs to inputs:
        foorpl.in_foo = foowrt.out_foo

        # Return the last task(s) in the workflow chain.
        return foorpl

That's it! And again, see the "usage" section just below for a more detailed description of getting to this!

Support: Getting help

Please use the issue queue for any support questions, rather than mailing the author(s) directly, as the solutions can then help others who face similar issues (we are a very small team with very limited time, so this is important).

Prerequisites

  • Python 2.7 - 3.4
  • Luigi 1.3.x - 2.0.1

Install

  1. Install SciLuigi, including its dependencies (luigi etc), through PyPI:

    pip install sciluigi
  2. Now you can use the library by just importing it in your python script, like so:

    import sciluigi

    Note that you can aliase it to a shorter name, for brevity, and to save keystrokes:

    import sciluigi as sl

Usage

Creating workflows in SciLuigi differs slightly from how it is done in vanilla Luigi. Very briefly, it is done in these main steps:

  1. Create a workflow tasks class
  2. Create task classes
  3. Add the workflow definition in the workflow class's workflow() method.
  4. Add a run method at the end of the script
  5. Run the script

Create a Workflow task

The first thing to do when creating a workflow, is to define a workflow task.

You do this by:

  1. Creating a subclass of sciluigi.WorkflowTask
  2. Implementing the workflow() method.

Example:

import sciluigi

class MyWorkflow(sciluigi.WorkflowTask):
    def workflow(self):
        pass # TODO: Implement workflow here later!

Create tasks

Then, you need to define some tasks that can be done in this workflow.

This is done by:

  1. Creating a subclass of sciluigi.Task (or sciluigi.SlurmTask if you want Slurm support)
  2. Adding fields named in_<yournamehere> for each input, in the new task class
  3. Define methods named out_<yournamehere>() for each output, that return sciluigi.TargetInfo objects. (sciluigi.TargetInfo is initialized with a reference to the task object itself - typically self - and a path name, where upstream tasks paths can be used).
  4. Define luigi parameters to the task.
  5. Implement the run() method of the task.

Example:

Let's define a simple task that just writes "foo" to a file named foo.txt:

class MyFooWriter(sciluigi.Task):
    # We have no inputs here
    # Define outputs:
    def out_foo(self):
        return sciluigi.TargetInfo(self, 'foo.txt')
    def run(self):
        with self.out_foo().open('w') as foofile:
            foofile.write('foo\n')

Then, let's create a task that replaces "foo" with "bar":

class MyFooReplacer(sciluigi.Task):
    replacement = sciluigi.Parameter() # Here, we take as a parameter
                                  # what to replace foo with.
    # Here we have one input, a "foo file":
    in_foo = None
    # ... and an output, a "bar file":
    def out_replaced(self):
        # As the path to the returned target(info), we
        # use the path of the foo file:
        return sciluigi.TargetInfo(self, self.in_foo().path + '.bar.txt')
    def run(self):
        with self.in_foo().open() as in_f:
            with self.out_replaced().open('w') as out_f:
                # Here we see that we use the parameter self.replacement:
                out_f.write(in_f.read().replace('foo', self.replacement))

The last lines, we could have instead written using the command-line sed utility, available in linux, by calling it on the commandline, with the built-in ex() method:

    def run(self):
        # Here, we use the in-built self.ex() method, to execute commands:
        self.ex("sed 's/foo/{repl}/g' {inpath} > {outpath}".format(
            repl=self.replacement,
            inpath=self.in_foo().path,
            outpath=self.out_replaced().path))

Write the workflow definition

Now, we can use these two tasks we created, to create a simple workflow, in our workflow class, that we also created above.

We do this by:

  1. Instantiating the tasks, using the self.new_task(<unique_taskname>, <task_class>, *args, **kwargs) method, of the workflow task.
  2. Connect the tasks together, by pointing the right out_* method to the right in_* field.
  3. Returning the last task in the chain, from the workflow method.

Example:

import sciluigi
class MyWorkflow(sciluigi.WorkflowTask):
    def workflow(self):
        foowriter = self.new_task('foowriter', MyFooWriter)
        fooreplacer = self.new_task('fooreplacer', MyFooReplacer,
            replacement='bar')

        # Here we do the *magic*: Connecting outputs to inputs:
        fooreplacer.in_foo = foowriter.out_foo

        # Return the last task(s) in the workflow chain.
        return fooreplacer

Add a run method to the end of the script

Now, the only thing that remains, is adding a run method to the end of the script.

You can use luigi's own luigi.run(), or our own two methods:

  1. sciluigi.run()
  2. sciluigi.run_local()

The run_local() one, is handy if you don't want to run a central scheduler daemon, but just want to run the workflow as a script.

Both of the above take the same options as luigi.run(), so you can for example set the main class to use (our workflow task):

# End of script ....
if __name__ == '__main__':
    sciluigi.run_local(main_task_cls=MyWorkflow)

Run the workflow

Now, you should be able to run the workflow as simple as:

python myworkflow.py

... provided of course, that the workflow is saved in a file named myworkflow.py.

More Examples

See the examples folder for more detailed examples!

More links, background info etc.

The basic idea behind SciLuigi, and a preceding solution to it, was presented in workshop (e-Infra MPS 2015) talk:

See also this collection of links, to more of our reported experiences using Luigi, which lead up to the creation of SciLuigi.

Known Limitations

  • Changing the workflow scheduling based on data sent as parameters, is not possible.
  • If you have an unknown number of outputs from a task, for which you want to start a full branch of the workflow, this is not possible either.

Both of the limitations are due to the fact that Luigi does scheduling and execution separately (with the exception of Luigi's dynamic dependencies, but they work only for upstream tasks, not downstream tasks, which we would need).

If you run into any of these problems, you might be interested in a new workflow engine we develop to overcome these limitations: SciPipe.

Changelog

  • 0.9.3b4
    • Support for Python 3 (Thanks to @jeffcjohnson for contributing this!).
    • Bug fixes.

Contributors

Acknowledgements

This work has been supported by:

Many ideas and inspiration for the API is taken from:

Publications using SciLuigi

Below is an incomplete list of publications using SciLuigi for computational analysis. If you are using SciLuigi in a publication, please consider adding your own here.

Schulz W, Durant T, Siddon A, Torres R. Use of application containers and workflows for genomic data analysis. J Pathol Inform. 2016;7(1):53. DOI: 10.4103/2153-3539.197197

See also: SciPipe

If you find yourself needing some more advanced scheduling features like dynamic scheduling, or run into performance problems with Python/Luigi/SciLuigi, you might be interested to check out a new workflow engine we develop, in the Go programming language, to cope with some of the limitations we have still faced with Python/Luigi/SciLuigi: SciPipe.

SciPipe leverages some of the successful parts of Luigi's API, such as the flexible file name formatting, but replaces the Luigi scheduler with a custom, novel and very light-weight implicit dataflow scheduler written in Go. We find that it makes life much easier for complex workflow constructs as those involving cross validation, and/or nested parameter sweeps.

More Repositories

1

plaid

PLAID (Plate Layouts using Artificial Intelligence Design) is a flexible constraint-programming model representing the Plate Layout Design problem.
Jupyter Notebook
13
star
2

plot_utils

Repo that groups utility functions for e.g. plotting of Conformal prediction metrics
Jupyter Notebook
13
star
3

phil_LNP_modelling

Python code and jupyter notebooks to accompany the manuscript "Deep learning models for lipid-nanoparticle-based drug delivery"
Jupyter Notebook
11
star
4

robotlab

Cell painting and imaging with robots
Python
10
star
5

cellprofiler-docker

Dockerfile to build a working CellProfiler (http://cellprofiler.org/) image
Dockerfile
10
star
6

CP-Chem-MoA

Jupyter Notebook
8
star
7

imagedb

Image database for pharmbio microscope images
JavaScript
5
star
8

aros

Open Automated Robotic System for Biological Laboratories
5
star
9

LC-MS-Pachyderm

Start-to-end LC-MS-analysis workflow definition on Pachyderm
Shell
5
star
10

ptp-project

Project Source Code Repository for the project "Predicting Off-Target Binding Profiles with Confidence using Conformal Prediction"
TeX
5
star
11

dl_quantmap

Deep learning implementation of quantmap method
Jupyter Notebook
5
star
12

SCPRegression

Code for Synergy Conformal Prediction for Regression
Python
5
star
13

urisolve

A simple web server enabling resolving RDF URIs based on data in a RDF HDT file, or a SPARQL endpoint
Go
5
star
14

bioimg-sciluigi-casestudy

An automated virtual machine setup, containing a use case for running workflows written in SciLuigi
Python
4
star
15

robotlab-labrobots

Web server to our LiCONiC incubator and BioTek washer and dispenser
Python
4
star
16

kubeflow-pipelines

Python
3
star
17

pharmbio-web

The Static (Hugo) website for pharmb.io
JavaScript
3
star
18

robot_lab_cameras

Setup for the live stream monitoring cameras of the robot lab
HTML
2
star
19

MTBLS233-Pachyderm

Scalable and reproducible metabolomics preprocessing workflow powered by Pachyderm
2
star
20

big-data_deeplearning

Big data course, Deep learning repo
Jupyter Notebook
2
star
21

clamp

Perl
2
star
22

pharmbio-notebook

Docker containers for interactive notebook environments for rancher cluster
Dockerfile
2
star
23

cplogd

Code related to the conformal prediction log D project
CSS
2
star
24

robot-imager

IMX imaging using the PreciseFlex robot arm and LiCONiC fridge
Python
2
star
25

big-data_course

Material for the Big Data in life science course
TeX
2
star
26

robot-cellpainter

Cell painting using the UR robot arm
Python
2
star
27

scipipe-demo

Demonstrator workflows for the SciPipe paper in GigaScience
Go
2
star
28

kensert_CNN

Jupyter Notebook
2
star
29

covid19-ptp

Predicted Target Profile towards proteins of interest for Covid-19
Go
1
star
30

mdr

Portal for multi-drug resistant bacteria
Perl
1
star
31

nondisc-acp

Aggregated Conformal Prediction on non-disclosed dataset
CSS
1
star
32

incubator-automation

incubator-automation
Python
1
star
33

PB-seq_2019_L9_machine_learning

1
star
34

pbcharts

Chart repository for Helm
Smarty
1
star
35

shaker-robot

Shaker robot modifications and code
Python
1
star
36

helm-charts

Mustache
1
star
37

kensert_rf_sparse

Python
1
star
38

labrobots-restserver-washer-dispenser

This repo is replaced with https://github.com/pharmbio/robotlab-labrobots and will be archived or deleted.
Python
1
star
39

ai4dd

Material for the course AI for drug discovery
Jupyter Notebook
1
star
40

KT_LUPI

Knowledge transfer LUPI
Python
1
star
41

labrobots-restserver

This repo is replaced with https://github.com/pharmbio/robotlab-labrobots and will be archived or deleted.
Python
1
star
42

cplogd-v2.0

cpLodD version 2.0 - using ChEMBL v33
Jupyter Notebook
1
star
43

nw-cp

A master thesis on using Graph Neural Networks and Network Analysis approach in morphological profiling of chemical perturbated cells
HTML
1
star
44

assay-transition-study

Code used for running the experiments of the Assay transition study
Java
1
star