• Stars
    star
    244
  • Rank 164,879 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Jupyter Notebooks in S3 - Jupyter Contents Manager implementation

S3Contents - Jupyter Notebooks in S3

A transparent, drop-in replacement for Jupyter standard filesystem-backed storage system. With this implementation of a Jupyter Contents Manager you can save all your notebooks, files and directory structure directly to a S3/GCS bucket on AWS/GCP or a self hosted S3 API compatible like MinIO.

Installation

pip install s3contents

Install with GCS dependencies:

pip install s3contents[gcs]

s3contents vs X

While there are some implementations of an S3 Jupyter Content Manager such as s3nb or s3drive s3contents is the only one tested against new versions of Jupyter. It also supports more authentication methods and Google Cloud Storage.

This aims to be a fully tested implementation and it's based on PGContents.

Configuration

Create a jupyter_notebook_config.py file in one of the Jupyter config directories for example: ~/.jupyter/jupyter_notebook_config.py.

Jupyter Notebook Classic: If you plan to use the Classic Jupyter Notebook interface you need to change ServerApp to NotebookApp for all the examples on this page.

AWS S3

from s3contents import S3ContentsManager

c = get_config()

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.bucket = "<S3 bucket name>"

# Fix JupyterLab dialog issues
c.ServerApp.root_dir = ""

Authentication

Additionally you can configure multiple authentication methods:

Access and secret keys:

c.S3ContentsManager.access_key_id = "<AWS Access Key ID / IAM Access Key ID>"
c.S3ContentsManager.secret_access_key = "<AWS Secret Access Key / IAM Secret Access Key>"

Session token:

c.S3ContentsManager.session_token = "<AWS Session Token / IAM Session Token>"

AWS EC2 role auth setup

It also possible to use IAM Role-based access to the S3 bucket from an Amazon EC2 instance or AWS resource.

To do that just leave any authentication options (access_key_id, secret_access_key) to their default of None and ensure that the EC2 instance has an IAM role which provides sufficient permissions (read and write) for the bucket.

Optional settings

# A prefix in the S3 buckets to use as the root of the Jupyter file system
c.S3ContentsManager.prefix = "this/is/a/prefix/on/the/s3/bucket"

# Server-Side Encryption
c.S3ContentsManager.sse = "AES256"

# Authentication signature version
c.S3ContentsManager.signature_version = "s3v4"

# See AWS key refresh
c.S3ContentsManager.init_s3_hook = init_function

AWS key refresh

The optional init_s3_hook configuration can be used to enable AWS key rotation (described here and here) as follows:

from aiobotocore.credentials import AioRefreshableCredentials
from aiobotocore.session import get_session
from configparser import ConfigParser

from s3contents import S3ContentsManager

def refresh_external_credentials():
    config = ConfigParser()
    config.read('/home/jovyan/.aws/credentials')
    return {
        "access_key": config['default']['aws_access_key_id'],
        "secret_key": config['default']['aws_secret_access_key'],
        "token": config['default']['aws_session_token'],
        "expiry_time": config['default']['aws_expiration']
    }

async def async_refresh_credentials():
    return refresh_external_credentials()

def make_key_refresh_boto3(this_s3contents_instance):
    session_credentials = AioRefreshableCredentials.create_from_metadata(
        metadata = refresh_external_credentials(),
        refresh_using = async_refresh_credentials,
        method = 'custom-refreshing-key-file-reader'
    )
    refresh_session =  get_session() # from aibotocore.session
    refresh_session._credentials = session_credentials
    this_s3contents_instance.boto3_session = refresh_session

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager

c.S3ContentsManager.init_s3_hook = make_key_refresh_boto3

MinIO playground example

You can test this using the play.minio.io:9000 playground:

Just be sure to create the bucket first.

from s3contents import S3ContentsManager

c = get_config()

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.access_key_id = "Q3AM3UQ867SPQQA43P2F"
c.S3ContentsManager.secret_access_key = "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG"
c.S3ContentsManager.endpoint_url = "https://play.minio.io:9000"
c.S3ContentsManager.bucket = "s3contents-demo"
c.S3ContentsManager.prefix = "notebooks/test"

Access local files

To access local file as well as remote files in S3 you can use hybridcontents.

Install it:

pip install hybridcontents

Use a configuration similar to this:

from s3contents import S3ContentsManager
from hybridcontents import HybridContentsManager
from notebook.services.contents.largefilemanager import LargeFileManager

c = get_config()

c.ServerApp.contents_manager_class = HybridContentsManager

c.HybridContentsManager.manager_classes = {
    # Associate the root directory with an S3ContentsManager.
    # This manager will receive all requests that don"t fall under any of the
    # other managers.
    "": S3ContentsManager,
    # Associate /local_directory with a LargeFileManager.
    "local_directory": LargeFileManager,
}

c.HybridContentsManager.manager_kwargs = {
    # Args for root S3ContentsManager.
    "": {
        "access_key_id": "<AWS Access Key ID / IAM Access Key ID>",
        "secret_access_key": "<AWS Secret Access Key / IAM Secret Access Key>",
        "bucket": "<S3 bucket name>",
    },
    # Args for the LargeFileManager mapped to /local_directory
    "local_directory": {
        "root_dir": "/Users/danielfrg/Downloads",
    },
}

GCP - Google Cloud Storage

Install the extra dependencies with:

pip install s3contents[gcs]
from s3contents.gcs import GCSContentsManager

c = get_config(

c.ServerApp.contents_manager_class = GCSContentsManager
c.GCSContentsManager.project = "<your-project>"
c.GCSContentsManager.token = "~/.config/gcloud/application_default_credentials.json"
c.GCSContentsManager.bucket = "<GCP bucket name>"

Note that the file ~/.config/gcloud/application_default_credentials.json assumes a POSIX system when you did gcloud init.

Other configuration

File Save Hooks

If you want to use pre/post file save hooks here are some examples.

A pre_save_hook is written in the exact same way as normal, operating on the file in local storage before committing it to the object store.

def scrub_output_pre_save(model, **kwargs):
    """
    Scrub output before saving notebooks
    """

    # only run on notebooks
    if model["type"] != "notebook":
        return

    # only run on nbformat v4
    if model["content"]["nbformat"] != 4:
        return

    for cell in model["content"]["cells"]:
        if cell["cell_type"] != "code":
            continue
        cell["outputs"] = []
        cell["execution_count"] = None

c.S3ContentsManager.pre_save_hook = scrub_output_pre_save

A post_save_hook instead operates on the file in object storage, because of this it is useful to use the file methods on the contents_manager for data manipulation. In addition, one must use the following function signature (unique to s3contents):

def make_html_post_save(model, s3_path, contents_manager, **kwargs):
    """
    Convert notebooks to HTML after saving via nbconvert
    """
    from nbconvert import HTMLExporter

    if model["type"] != "notebook":
        return

    content, _format = contents_manager.fs.read(s3_path, format="text")
    my_notebook = nbformat.reads(content, as_version=4)

    html_exporter = HTMLExporter()
    html_exporter.template_name = "classic"

    (body, resources) = html_exporter.from_notebook_node(my_notebook)

    base, ext = os.path.splitext(s3_path)
    contents_manager.fs.write(path=(base + ".html"), content=body, format=_format)

c.S3ContentsManager.post_save_hook = make_html_post_save

More Repositories

1

word2vec

Python interface to Google word2vec
C
2,566
star
2

pelican-jupyter

Pelican plugin for blogging with Jupyter/IPython Notebooks
Jupyter Notebook
423
star
3

tsne

A python wrapper for Barnes-Hut tsne
C++
405
star
4

mkdocs-jupyter

Use Jupyter Notebook in mkdocs
Jupyter Notebook
375
star
5

jupyter-flex

Build dashboards using Jupyter Notebooks
JavaScript
315
star
6

copper

Fast, easy and intuitive machine learning prototyping.
Python
124
star
7

PythonFinance

basic Python Finance Package
Python
104
star
8

espn-nba-scrapy

NBA Data mining
Python
70
star
9

demucs-service

Use DEMUCS to split songs into multiple sources
Python
23
star
10

polyaxon-argo-seldon-example

Model management example using Polyaxon, Argo and Seldon
Python
23
star
11

demucs-app

Use DEMUCS to split songs into multiple sources
JavaScript
20
star
12

datasciencebox

Create and manage instances for data science
Python
20
star
13

danielfrg.com

Source for danielfrg.com
HTML
20
star
14

gcp-llm-retrieval-augmentation

A retrieval augmentation LLM demo in GCP
Jupyter Notebook
18
star
15

harvard-cs109-fall-2013

Harvard CS 109 - Data Science - Fall 2013
Python
14
star
16

jupyterhub-kubernetes_spawner

JupyterHub Kubernete Spawner
Python
14
star
17

django-hospital

Python
12
star
18

storm-sklearn

from zero to storm cluster for realtime classification using sklearn
Python
12
star
19

terraform-cloudera

Terraform module for Cloudera Manager
HCL
11
star
20

illusionist

Interactive client-only reports based on Jupyter Notebooks and Jupyter widgets.
Python
11
star
21

docker-conda-repo

Docker container for creating and serving a custom conda repo/channel
Python
11
star
22

docker-selenium

docker-selenium
Jupyter Notebook
11
star
23

salt-conda

Salt states for Continuum Analytics conda python package manager
Python
9
star
24

django_crawler

A django blog crawler
Python
9
star
25

kaggle-word2vec

Kaggle word2vec NLP tutorial
Python
8
star
26

sublime-open

Open files quicker and easier: Dynamic browsing or a static list of files
Python
7
star
27

semafor-parsing

Parsing web content on SEMAFOR at scale using salt and celery
Python
6
star
28

coursera-comp-for-data-analysis

Coursera Computing for Data Analysis - Fall 2012
R
5
star
29

kaggle-yelp-recruiting-competition

Python
5
star
30

nbviewer.js

Render Jupyter Notebooks in the browser using only JS
JavaScript
5
star
31

remote-pip

Install pip packages in remote hosts
Python
4
star
32

web-template-go-react

Template for Go + React + TS
Go
3
star
33

reproduceit-538-baltimore-black-income

ReproduceIt: How Baltimoreโ€™s Young Black Men Are Boxed In
3
star
34

kaggle-bulldozers

Python
3
star
35

spark-plot

Simplifies plotting Spark DataFrames by making calculations for plots inside Spark
Jupyter Notebook
3
star
36

github-archive

The simplest script to make an archive of a Github user/org
Shell
3
star
37

kaggle-data-science-london

Python
3
star
38

coursera-data-analysis

Coursera Data Analysis - Fall 2012
Python
2
star
39

docker-multicorn

Hello world of multicorn in a docker container
Shell
2
star
40

newtask

Desktop python app to create new tasks on major task management web apps
Python
2
star
41

atom-nbviewer

Atom plugin to preview Jupyter Notebooks
JavaScript
2
star
42

docker-rpi2xc

Docker container to cross-compile for the rpi2
CMake
1
star
43

ec2hosts

Update /etc/hosts from ec2 instances
Python
1
star
44

pydata-nyc-2015

PyData NYC 2015
HTML
1
star
45

awesome-google-ads-api

A curated list of awesome Google Ads API resources
1
star
46

actions-dashboard

Dashboard for my GitHub Actions
1
star
47

remote-conda

Install conda packages in remote hosts
Go
1
star
48

cyhdfs3

Cython based wrapper for libhdfs3
Python
1
star
49

adsctl

Google Ads Control CLI and prompt
Python
1
star
50

ml-notes

Personal tools and notebooks for machine learning
Jupyter Notebook
1
star
51

grpc-up-and-running

Rust
1
star
52

atom-lighttable-syntax

An attempt of a Light Table syntax theme for atom
CSS
1
star
53

kaggle-salt

https://www.kaggle.com/c/tgs-salt-identification-challenge
Jupyter Notebook
1
star
54

reproduceit-538-adam-sandler-movies

ReproduceIt: The Three Types Of Adam Sandler Movies
1
star
55

django-ddns

Django DDNS app using AppFog as Paas.
Python
1
star
56

coursera-computational-investing-part-I

Computational Investing Part I - Fall 2012
1
star