• Stars
    star
    200
  • Rank 195,325 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    The Unlicense
  • Created about 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A starter template for Equinor data science / data engineering projects

Build Status

Data Science Template

This is a starter template for data science projects in Equinor, although it may also be useful for others. It contains many of the essential artifacts that you will need and presents a number of best practices including code setup, samples, MLOps using Azure, a standard document to guide and gather information relating to the data science process and more.

As it is impossible to create a single template that will meet every projects needs, this example should be considered a starting point and changed based upon the working and evolution of your project.

Before working with the contents of this template or Data Science projects in general it is recommended to familiarise yourself with the Equinor Data Science Technical Standards (Currently Equinor internal only)

Getting Started With This Template

This template is provided as a Cookiecutter template so you can quickly create an instance customised for your project. An assumption is that you have a working python installation.

To get running, first install the latest Cookiecutter if you haven't installed it yet (this requires Cookiecutter 1.4.0 or higher):

pip install -U cookiecutter

Create project

Then generate a new project for your own use based upon the template, answering the questions to customise the generated project:

cookiecutter https://github.com/equinor/data-science-template.git

The values you are prompted for are:

Value Description
project_name A name for your project. Used mostly within documentation
project_description A description to include in the README.md
repo_name The name of the github repository where the project will be held
conda_name The name of the conda environment to use
package_name A name for the generated python package.
mlops_name Default name for Azure ML.
mlops_compute_name Default Azure ML compute cluster name to use.
author The main author of the solution. Included in the setup.py file
open_source_license What type of open source license the project will be released under
devops_organisation An Azure DevOps organisation. Leave blank if you aren't using Azure DevOps

If you are uncertain about what to enter for any value then just accept the defaults. You can always change the generated project later.

Getting problems? You can always download this repository using the download button above and reference the local copy e.g. cookiecutter c:\Downloads\data-science-template, however ideally fix any git proxy or other issues that are causing problems.

You are now ready to get started, however you should first create a new github repository for your new project and add your project using the following commands (substitute myproject with the name of your project and REMOTE-REPOSITORY-URL with the remote repository url).

cd myproject
git init
git add .
git commit -m "Initial commit"
git remote add origin REMOTE-REPOSITORY-URL
git remote -v
git push origin master

Continuous Integration

Continuous Integration (CI) increase quality by building, running tests and performing other validation whenever code is committed. The template contains a build pipeline for Azure DevOps, however requires a couple of manual steps to setup:

  • Log in to http://dev.azure.com and browse to, or create an organisation & project. The project name should be the same as your github repository name.
  • Under Pipelines -> Builds select New Pipeline
  • Select github and then your repository. Login / grant any permissions as prompted
  • In the review pane click run

You are now setup for CI and automated test / building. You should verify the badge link in this README corresponds with your DevOps project, and as a further step might setup any release pipelines for automated deployment.

At this stage the build pipeline doesn't include MLOps steps, although these can be added based uon your needs.

Finally

  • Update the project readme file with additional project specific details including setup, configuration and usage.
  • The docs\process_documentation.md file should be completed phase by phase, and each phase result shall be submitted for review and approval before the project moves on to the next phase. This is to assist with the gathering of essential information required to deliver a correct and robust solution. The git respoitory shall be added to the script that populates the knowledge repository to ease future knowledge sharing.

Generated Project Contents

Depending upon the selected options when creating the project, the generated structure will look similar to the below:

β”œβ”€β”€ .gitignore               <- Files that should be ignored by git. Add seperate .gitignore files in sub folders if 
β”‚                               needed
β”œβ”€β”€ conda_env.yml            <- Conda environment definition for ensuring consistent setup across environments
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md                <- The top-level README for developers using this project.
β”œβ”€β”€ requirements.txt         <- The requirements file for reproducing the analysis environment, e.g.
β”‚                               generated with `pip freeze > requirements.txt`. Might not be needed if using conda.
β”œβ”€β”€ setup.py                 <- Metadata about your project for easy distribution.
β”‚
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ interim_[desc]       <- Interim files - give these folders whatever name makes sense.
β”‚   β”œβ”€β”€ processed            <- The final, canonical data sets for modeling.
β”‚   β”œβ”€β”€ raw                  <- The original, immutable data dump.
β”‚   β”œβ”€β”€ temp                 <- Temporary files.
β”‚   └── training             <- Files relating to the training process
β”‚
β”œβ”€β”€ docs                     <- Documentation
β”‚   β”œβ”€β”€ data_science_code_of_conduct.md  <- Code of conduct.
β”‚   β”œβ”€β”€ process_documentation.md         <- Standard template for documenting process and decisions.
β”‚   └── writeup              <- Sphinx project for project writeup including auto generated API.
β”‚      β”œβ”€β”€ conf.py           <- Sphinx configurtation file.
β”‚      β”œβ”€β”€ index.rst         <- Start page.
β”‚      β”œβ”€β”€ make.bat          <- For generating documentation (Windows)
β”‚      └── Makefikle         <- For generating documentation (make)
β”‚
β”œβ”€β”€ examples                 <- Add folders as needed e.g. examples, eda, use case
β”‚
β”œβ”€β”€ extras                   <- Miscellaneous extras.
β”‚   └── add_explorer_context_shortcuts.reg    <- Adds additional Windows Explorer context menus for starting jupyter.
β”‚
β”œβ”€β”€ notebooks                <- Notebooks for analysis and testing
β”‚   β”œβ”€β”€ eda                  <- Notebooks for EDA
β”‚   β”‚   └── example.ipynb    <- Example python notebook
β”‚   β”œβ”€β”€ features             <- Notebooks for generating and analysing features (1 per feature)
β”‚   β”œβ”€β”€ modelling            <- Notebooks for modelling
β”‚   └── preprocessing        <- Notebooks for Preprocessing 
β”‚
β”œβ”€β”€ scripts                  <- Standalone scripts
β”‚   β”œβ”€β”€ deploy               <- MLOps scripts for deployment (WIP)
β”‚   β”‚   └── score.py         <- Scoring script
β”‚   β”œβ”€β”€ train                <- MLOps scripts for training
β”‚   β”‚   β”œβ”€β”€ submit-train.py  <- Script for submitting a training run to Azure ML Service
β”‚   β”‚   β”œβ”€β”€ submit-train-local.py <- Script for local training using Azure ML
β”‚   β”‚   └── train.py         <- Example training script using the iris dataset
β”‚   β”œβ”€β”€ example.py           <- Example sctipt
β”‚   └── MLOps.ipynb          <- End to end MLOps example (To be refactored into the above)
β”‚
β”œβ”€β”€ src                      <- Code for use in this project.
β”‚   └── examplepackage       <- Example python package - place shared code in such a package
β”‚       β”œβ”€β”€ __init__.py      <- Python package initialisation
β”‚       β”œβ”€β”€ examplemodule.py <- Example module with functions and naming / commenting best practices
β”‚       β”œβ”€β”€ features.py      <- Feature engineering functionality
β”‚       β”œβ”€β”€ io.py            <- IO functionality
β”‚       └── pipeline.py      <- Pipeline functionality
β”‚
└── tests                    <- Test cases (named after module)
    β”œβ”€β”€ test_notebook.py     <- Example testing that Jupyter notebooks run without errors
    β”œβ”€β”€ examplepackage       <- examplepackage tests
        β”œβ”€β”€ examplemodule    <- examplemodule tests (1 file per method tested)
        β”œβ”€β”€ features         <- features tests
        β”œβ”€β”€ io               <- io tests
        └── pipeline         <- pipeline tests

Contributing to This Template

Contributions to this template are greatly appreciated and encouraged.

To contribute an update simply:

  • Submit an issue describing your proposed change to the repo in question.
  • The repo owner will respond to your issue promptly.
  • Fork the desired repo, develop and test your code changes.
  • Check that your code follows the PEP8 guidelines (line lengths up to 120 are ok) and other general conventions within this document.
  • Ensure that your code adheres to the existing style. Refer to the Google Cloud Platform Samples Style Guide for the recommended coding standards for this organization.
  • Ensure that as far as possible there are unit tests covering the functionality of any new code.
  • Check that all existing unit tests still pass.
  • Edit this document and the template README.md if needed to describe new files or other important information.
  • Submit a pull request.

Template development environment

To develop this template further you might want to setup a virtual environment

Setup using

cd data-science-template
python -m venv dst-env

Activate environment

Max / Linux

source dst-env/bin/activate

Windows

dst-env\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Testing

To run the template tests, install pytest using pip or conda and then from the repository root run

pytest tests

Linting

To verify that your code adheres to python standards run linting as shown below:

flake8 --max-line-length=120 *.py hooks/ tests/

Important Links

References

More Repositories

1

segyio

Fast Python library for SEGY files.
Python
476
star
2

design-system

The Equinor design system
TypeScript
119
star
3

dlisio

Python library for working with the well log formats Digital Log Interchange Standard (DLIS V1) and Log Information Standard (LIS79)
Python
104
star
4

xtgeo

XTGeo Python class library for subsurface Surfaces, Cubes, Wells, Grids, Points, etc
Python
104
star
5

segyviewer

Python viewer for SEG-Y files
Python
101
star
6

segyio-notebooks

Notebooks with examples and demos of segyio
Jupyter Notebook
101
star
7

resdata

Software for reading and writing the result files from the Eclipse reservoir simulator.
C++
100
star
8

ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
Python
99
star
9

api-strategy

Equinor API Strategy
93
star
10

flownet

FlowNet - Data-Driven Reservoir Predictions
Python
63
star
11

neqsim

NeqSim is a library for calculation of fluid behavior, phase equilibrium and process simulation
Java
63
star
12

seismic-zfp

Compress and decompress seismic data
Python
60
star
13

webviz-subsurface

Webviz-config plugins for subsurface data.
Python
56
star
14

neqsim-python

NeqSim is a library for calculation of fluid behavior, phase equilibrium and process simulation. This project is a Python interface to NeqSim.
Python
56
star
15

template-fastapi-react

A solution template for creating a Single Page App (SPA) with React and FastAPI following the principles of Clean Architecture.
Python
55
star
16

webviz-config

Make Dash applications from a user-friendly config file πŸ“– 🐍
Python
55
star
17

pyscal

Python module for relative permeability/SCAL support in reservoir simulation
Python
54
star
18

witsml-explorer

Witsml Explorer data management tool.
C#
45
star
19

opc-ua-information-models

OPC UA information models developed by Equinor
38
star
20

webviz-subsurface-components

Custom subsurface visualizations for use in Webviz and/or Dash.
TypeScript
35
star
21

OmniaPlant

Documentation on how to get started building industrial applications and services by using Omnia Plant Data Platform
C#
30
star
22

energyvision

Home of the equinor.com website
TypeScript
30
star
23

libres

C++
29
star
24

tagreader-python

A Python package for reading trend data from the OSIsoft PI and Aspen InfoPlus.21 historians
Python
28
star
25

res2df

Pandas Dataframe access to Eclipse input and output files
Python
28
star
26

gordo

An API-first distributed deployment system of deep learning models using timeseries data to predict the behaviour of systems
Python
27
star
27

dlisio-notebooks

Jupyter Notebook
26
star
28

OpenServer

Code for running Petroleum Experts OpenServer API commands in Python
Python
25
star
29

radix-platform

Omnia Radix platform - base scripts and code
Shell
24
star
30

sdp-flux

Flux continuous delivery to k8s
Shell
23
star
31

az-static-web-app-docs-template

JavaScript
23
star
32

flowify-workflows-server

Go
22
star
33

oneseismic

C++
21
star
34

leaflet.tilelayer.gloperations

Custom Leaflet TileLayer using WebGL to do operations on and colorize floating-point pixels
TypeScript
19
star
35

webviz-archived

THIS REPO IS NOT MAINTAINED ANYMORE. Go to https://github.com/equinor/webviz-config for the new repo.
Python
19
star
36

isar

Integration and Supervisory control of Autonomous Robots
Python
17
star
37

iterative_ensemble_smoother

Algorithms for data assimilation using ensemble methods.
Python
17
star
38

curvy

The Smooth Forward Price Curve builder you never thought you needed
Jupyter Notebook
16
star
39

appsec

Everything Application Security
16
star
40

rvmsharp

An C# Aveva .RVM file format parser. Also includes a pipeline for the reveal 3D model format.
C#
16
star
41

videx-wellog

Well log components
TypeScript
15
star
42

resfo

Parser and Writer of the reservoir fortran ouput data format. That means files with extensions such as .EGRID, .INIT, .SMRY, .UNSMRY, .GRID, .FEGRID, .FINIT, etc. Such files are generated by many reservoir simulators such as OPM flow, Eclipse, etc.
Python
15
star
43

eNLP

A python library of commonly used NLP functions for processing, understanding and visualisation.
Python
15
star
44

webviz-core-components

Discipline agnostic Webviz and Dash components.
TypeScript
15
star
45

subscript

Equinor's collection of subsurface reservoir modelling scripts
Python
15
star
46

react-native-trustkit

Objective-C
14
star
47

open_petro_elastic

Utility for calculating elastic properties of petroleum fields
Python
13
star
48

fusion-components

Common react components used by fusion core and fusion apps https://equinor.github.io/fusion-components
TypeScript
13
star
49

fusion-web-components

repository for Fusion web components
TypeScript
12
star
50

ecalc

eCalcβ„’ is a software tool for calculation of energy demand and greenhouse gas emissions from oil and gas production and processing.
Python
12
star
51

esv-intersection

A reusable component to create intersection visualizations for wells
TypeScript
12
star
52

fmu-ensemble

Python objectification of reservoir model ensembles left on disk by ERT.
Python
12
star
53

webviz-ert

ERT webviz plugins
Python
12
star
54

snapwell

The Snapwell wellpath optimization tool
Python
11
star
55

appsec-fundamentals-authn-authz-cs

A hands-on AppSec fundamentals workshop where we explore protecting API's and Web apps
JavaScript
11
star
56

flotilla

Flotilla is the main point of access for operators to interact with multiple robots in a facility.
C#
11
star
57

fmu-tools

fmu-tools is a library containing tools for pre- and post-processing in a Fast Model Update (FMU) context
Python
10
star
58

dass

Data Assimilation in Python for teaching purposes
Jupyter Notebook
10
star
59

sdp-omnia

SDP's resources in Omnia (Equinor's data platform on Azure). Mainly AKS but also VMs
Shell
10
star
60

flowify-workflows-UI

TypeScript
10
star
61

radix-flux

Omnia Radix GitOps tools
10
star
62

fmu-dataio

FMU data standard and data export with rich metadata in the FMU context
Python
10
star
63

MamasKitchen

A demo repository for learning more about GitHub and Git in Equinor
Python
10
star
64

semeio

Semeio is a collection of jobs and workflow jobs used in ert (https://github.com/equinor/ert).
Python
10
star
65

data-engineering

Various resources for Data Engineering in Equinor.
10
star
66

radix-operator

The operator for the Radix platform
Go
9
star
67

fusion-framework

Fusion Framework, built and maintained to Fusion Core
TypeScript
9
star
68

procosys-js-frontend

Frontend javascript application for Project Completion System (ProCoSys)
TypeScript
9
star
69

pysand

Sand Management in Python
Python
8
star
70

webviz-subsurface-testdata

Testdata for use with webviz-subsurface
OpenEdge ABL
8
star
71

fmu

Landing repo for everything FMU
8
star
72

datamagic

A mini-course about working with data in Equinor
Jupyter Notebook
8
star
73

opensource

CSS
8
star
74

appsec-fundamentals-secret-scanning

A 3 hour workshop on getting started with secret scanning in your SDLC
Shell
8
star
75

fusion-workspace

TypeScript
8
star
76

api-first-workbench

The ultimate workbench with tools and processes to use for API first
JavaScript
8
star
77

neqsim-matlab

NeqSim is a library for calculation of fluid behavior, phase equilibrium and process simulation. This project is the Matlab interface to NeqSim.
MATLAB
8
star
78

python-intermediate

Python intermediate course
8
star
79

sunbeam

Python
7
star
80

iec63131

PowerShell
7
star
81

fusion-react-components

mono repo for collection of react components used for Fusion apps
TypeScript
7
star
82

neqsimNET

NeqSim is a library for calculation of fluid behavior, phase equilibrium and process simulation. This project implements the .Net interface to NeqSim.
7
star
83

edc22-nlp

EDC 2022 NLP workshop
Jupyter Notebook
7
star
84

osdu-sdk-python

Python SDK for working with OSDU
Python
7
star
85

gordo-controller

Kubernetes controller for the Gordo CRD
Rust
7
star
86

mad

Experimental monorepo for the Mobile Application Delivery team
TypeScript
7
star
87

configsuite

Config Suite is the result of recognizing the complexity of software configuration.
Python
7
star
88

neqsimExcelCapeOpen

NeqSim is a library for calculation of fluid behavior, phase equilibrium and process simulation. This project is the Excel and Cape Open user interface for NeqSim.
C#
7
star
89

radix-web-console

Radix Web Console; the web-based GUI to administer Radix applications
TypeScript
6
star
90

auth-dash-app-template

Dash app with Azure AD authentication
Python
6
star
91

vscode-lang-e100e300

Visual Studio Code language support for Eclipse (E100/E300)
6
star
92

vscode-septic

Extension for Visual Studio Code to support the SEPTIC config file format.
TypeScript
6
star
93

sepes-api

A platform that allows vendors to prove their solutions on your data.
C#
6
star
94

rdf-graph

Visualize RDF as a graph network
TypeScript
6
star
95

fusion-api

Fusion framework app API
TypeScript
6
star
96

webviz

TypeScript
6
star
97

skjermen

The One, But Not Only, Info Screen!
HTML
6
star
98

GraphSPME

Graphical Sparse Precision Matrix Estimation | For very high dimensions and with asymptotic regularization
Jupyter Notebook
6
star
99

seismic-forward

C++
6
star
100

git-introduction

6
star