• Stars
    star
    281
  • Rank 147,023 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A library for recording and reading data in notebooks.

scrapbook logo

scrapbook

CI image Documentation Status badge badge Code style: black

The scrapbook library records a notebook’s data values and generated visual content as "scraps". Recorded scraps can be read at a future time.

See the scrapbook documentation for more information on how to use scrapbook.

Use Cases

Notebook users may wish to record data produced during a notebook's execution. This recorded data, scraps, can be used at a later time or passed in a workflow to another notebook as input.

Namely, scrapbook lets you:

  • persist data and visual content displays in a notebook as scraps
  • recall any persisted scrap of data
  • summarize collections of notebooks

Python Version Support

This library's long term support target is Python 3.6+. It currently also supports Python 2.7 until Python 2 reaches end-of-life in 2020. After this date, Python 2 support will halt, and only 3.x versions will be maintained.

Installation

Install using pip:

pip install scrapbook

For installing optional IO dependencies, you can specify individual store bundles, like s3 or azure:

pip install scrapbook[s3]

or use all:

pip install scrapbook[all]

Models and Terminology

Scrapbook defines the following items:

  • scraps: serializable data values and visualizations such as strings, lists of objects, pandas dataframes, charts, images, or data references.
  • notebook: a wrapped nbformat notebook object with extra methods for interacting with scraps.
  • scrapbook: a collection of notebooks with an interface for asking questions of the collection.
  • encoders: a registered translator of data to/from notebook storage formats.

scrap model

The scrap model houses a few key attributes in a tuple, including:

  • name: The name of the scrap
  • data: Any data captured by the scrapbook api call
  • encoder: The name of the encoder used to encode/decode data to/from the notebook
  • display: Any display data used by IPython to display visual content

API

Scrapbook adds a few basic api commands which enable saving and retrieving data including:

  • glue to persist scraps with or without display output
  • read_notebook reads one notebook
  • scraps provides a searchable dictionary of all scraps by name
  • reglue which copies a scrap from another notebook to the current notebook
  • read_notebooks reads many notebooks from a given path
  • scraps_report displays a report about collected scraps
  • papermill_dataframe and papermill_metrics for backward compatibility for two deprecated papermill features

The following sections provide more detail on these api commands.

glue to persist scraps

Records a scrap (data or display value) in the given notebook cell.

The scrap (recorded value) can be retrieved during later inspection of the output notebook.

"""glue example for recording data values"""
import scrapbook as sb

sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("some_dict", {"a": 1, "b": 2})
sb.glue("non_json", df, 'arrow')

The scrapbook library can be used later to recover scraps from the output notebook:

# read a notebook and get previously recorded scraps
nb = sb.read_notebook('notebook.ipynb')
nb.scraps

scrapbook will imply the storage format by the value type of any registered data encoders. Alternatively, the implied encoding format can be overwritten by setting the encoder argument to the registered name (e.g. "json") of a particular encoder.

This data is persisted by generating a display output with a special media type identifying the content encoding format and data. These outputs are not always visible in notebook rendering but still exist in the document. Scrapbook can then rehydrate the data associated with the notebook in the future by reading these cell outputs.

With display output

To display a named scrap with visible display outputs, you need to indicate that the scrap is directly renderable.

This can be done by toggling the display argument.

# record a UI message along with the input string
sb.glue("hello", "Hello World", display=True)

The call will save the data and the display attributes of the Scrap object, making it visible as well as encoding the original data. This leans on the IPython.core.formatters.format_display_data function to translate the data object into a display and metadata dict for the notebook kernel to parse.

Another pattern that can be used is to specify that only the display data should be saved, and not the original object. This is achieved by setting the encoder to be display.

# record an image without the original input object
sb.glue("sharable_png",
  IPython.display.Image(filename="sharable.png"),
  encoder='display'
)

Finally the media types that are generated can be controlled by passing a list, tuple, or dict object as the display argument.

sb.glue("media_as_text_only",
  media_obj,
  encoder='display',
  display=('text/plain',) # This passes [text/plain] to format_display_data's include argument
)

sb.glue("media_without_text",
  media_obj,
  encoder='display',
  display={'exclude': 'text/plain'} # forward to format_display_data's kwargs
)

Like data scraps, these can be retrieved at a later time be accessing the scrap's display attribute. Though usually one will just use Notebook's reglue method (described below).

read_notebook reads one notebook

Reads a Notebook object loaded from the location specified at path. You've already seen how this function is used in the above api call examples, but essentially this provides a thin wrapper over an nbformat's NotebookNode with the ability to extract scrapbook scraps.

nb = sb.read_notebook('notebook.ipynb')

This Notebook object adheres to the nbformat's json schema, allowing for access to its required fields.

nb.cells # The cells from the notebook
nb.metadata
nb.nbformat
nb.nbformat_minor

There's a few additional methods provided, most of which are outlined in more detail below:

nb.scraps
nb.reglue

The abstraction also makes saved content available as a dataframe referencing each key and source. More of these methods will be made available in later versions.

# Produces a data frame with ["name", "data", "encoder", "display", "filename"] as columns
nb.scrap_dataframe # Warning: This might be a large object if data or display is large

The Notebook object also has a few legacy functions for backwards compatibility with papermill's Notebook object model. As a result, it can be used to read papermill execution statistics as well as scrapbook abstractions:

nb.cell_timing # List of cell execution timings in cell order
nb.execution_counts # List of cell execution counts in cell order
nb.papermill_metrics # Dataframe of cell execution counts and times
nb.papermill_record_dataframe # Dataframe of notebook records (scraps with only data)
nb.parameter_dataframe # Dataframe of notebook parameters
nb.papermill_dataframe # Dataframe of notebook parameters and cell scraps

The notebook reader relies on papermill's registered iorw to enable access to a variety of sources such as -- but not limited to -- S3, Azure, and Google Cloud.

scraps provides a name -> scrap lookup

The scraps method allows for access to all of the scraps in a particular notebook.

nb = sb.read_notebook('notebook.ipynb')
nb.scraps # Prints a dict of all scraps by name

This object has a few additional methods as well for convenient conversion and execution.

nb.scraps.data_scraps # Filters to only scraps with `data` associated
nb.scraps.data_dict # Maps `data_scraps` to a `name` -> `data` dict
nb.scraps.display_scraps # Filters to only scraps with `display` associated
nb.scraps.display_dict # Maps `display_scraps` to a `name` -> `display` dict
nb.scraps.dataframe # Generates a dataframe with ["name", "data", "encoder", "display"] as columns

These methods allow for simple use-cases to not require digging through model abstractions.

reglue copys a scrap into the current notebook

Using reglue one can take any scrap glue'd into one notebook and glue into the current one.

nb = sb.read_notebook('notebook.ipynb')
nb.reglue("table_scrap") # This copies both data and displays

Any data or display information will be copied verbatim into the currently executing notebook as though the user called glue again on the original source.

It's also possible to rename the scrap in the process.

nb.reglue("table_scrap", "old_table_scrap")

And finally if one wishes to try to reglue without checking for existence the raise_on_missing can be set to just display a message on failure.

nb.reglue("maybe_missing", raise_on_missing=False)
# => "No scrap found with name 'maybe_missing' in this notebook"

read_notebooks reads many notebooks

Reads all notebooks located in a given path into a Scrapbook object.

# create a scrapbook named `book`
book = sb.read_notebooks('path/to/notebook/collection/')
# get the underlying notebooks as a list
book.notebooks # Or `book.values`

The path reuses papermill's registered iorw to list and read files form various sources, such that non-local urls can load data.

# create a scrapbook named `book`
book = sb.read_notebooks('s3://bucket/key/prefix/to/notebook/collection/')

The Scrapbook (book in this example) can be used to recall all scraps across the collection of notebooks:

book.notebook_scraps # Dict of shape `notebook` -> (`name` -> `scrap`)
book.scraps # merged dict of shape `name` -> `scrap`

scraps_report displays a report about collected scraps

The Scrapbook collection can be used to generate a scraps_report on all the scraps from the collection as a markdown structured output.

book.scraps_report()

This display can filter on scrap and notebook names, as well as enable or disable an overall header for the display.

book.scraps_report(
  scrap_names=["scrap1", "scrap2"],
  notebook_names=["result1"], # matches `/notebook/collections/result1.ipynb` pathed notebooks
  header=False
)

By default the report will only populate with visual elements. To also report on data elements set include_data.

book.scraps_report(include_data=True)

papermill support

Finally the scrapbook provides two backwards compatible features for deprecated papermill capabilities:

book.papermill_dataframe
book.papermill_metrics

Encoders

Encoders are accessible by key names to Encoder objects registered against the encoders.registry object. To register new data encoders simply call:

from encoder import registry as encoder_registry
# add encoder to the registry
encoder_registry.register("custom_encoder_name", MyCustomEncoder())

The encode class must implement two methods, encode and decode:

class MyCustomEncoder(object):
    def encode(self, scrap):
        # scrap.data is any type, usually specific to the encoder name
        pass  # Return a `Scrap` with `data` type one of [None, list, dict, *six.integer_types, *six.string_types]

    def decode(self, scrap):
        # scrap.data is one of [None, list, dict, *six.integer_types, *six.string_types]
        pass  # Return a `Scrap` with `data` type as any type, usually specific to the encoder name

This can read transform scraps into a json object representing their contents or location and load those strings back into the original data objects.

text

A basic string storage format that saves data as python strings.

sb.glue("hello", "world", "text")

json

sb.glue("foo_json", {"foo": "bar", "baz": 1}, "json")

pandas

sb.glue("pandas_df",pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}), "pandas")

papermill's deprecated record feature

scrapbook provides a robust and flexible recording schema. This library replaces papermill's existing record functionality.

Documentation for papermill record exists on ReadTheDocs. In brief, the deprecated record function:

pm.record(name, value): enables values to be saved with the notebook [API documentation]

pm.record("hello", "world")
pm.record("number", 123)
pm.record("some_list", [1, 3, 5])
pm.record("some_dict", {"a": 1, "b": 2})

pm.read_notebook(notebook): pandas could be used later to recover recorded values by reading the output notebook into a dataframe. For example:

nb = pm.read_notebook('notebook.ipynb')
nb.dataframe

Rationale for Papermill record deprecation

Papermill's record function was deprecated due to these limitations and challenges:

  • The record function didn't follow papermill's pattern of linear execution of a notebook. It was awkward to describe record as an additional feature of papermill, and really felt like describing a second less developed library.
  • Recording / Reading required data translation to JSON for everything. This is a tedious, painful process for dataframes.
  • Reading recorded values into a dataframe would result in unintuitive dataframe shapes.
  • Less modularity and flexiblity than other papermill components where custom operators can be registered.

To overcome these limitations in Papermill, a decision was made to create Scrapbook.

More Repositories

1

nteract

📘 The interactive computing suite for you! ✨
TypeScript
6,204
star
2

papermill

📚 Parameterize, execute, and analyze notebooks
Python
5,960
star
3

hydrogen

:atom: Run code interactively, inspect data, and plot. All the power of Jupyter kernels, inside your favorite text editor.
TypeScript
3,924
star
4

semiotic

A data visualization framework combining React & D3
JavaScript
2,434
star
5

commuter

🚎 Notebook sharing hub
JavaScript
493
star
6

testbook

🧪 📗 Unit test your Jupyter Notebooks the right way
Python
418
star
7

vdom

🎄 Virtual DOM for Python
Jupyter Notebook
222
star
8

bookstore

📚 Notebook storage and publishing workflows for the masses
Python
202
star
9

data-explorer

The Data Explorer is nteract's automatic visualization tool.
TypeScript
103
star
10

sidecar

🚤 Little side display of Jupyter kernel rich output
JavaScript
90
star
11

ansi-to-react

💂‍♂️ ANSI to React
TypeScript
88
star
12

nteract.io

📣 Our site! 📣
JavaScript
44
star
13

dx

Data Explorer for Python
Python
37
star
14

create-nteract-app

⚡ Create an nteractive application with zero configuration
JavaScript
35
star
15

pick

⛏ Customize your kernels on Launch!
Jupyter Notebook
32
star
16

semiotic-docs

Docs for using Semiotic
JavaScript
32
star
17

commutable

♻️ Operations for Immutable Notebook Documents
29
star
18

mathjax-electron

🔣:electron: A trimmed down version of the MathJax library for use with electron and modern browsers
JavaScript
28
star
19

examples

Example nteract notebooks with links to execution on mybinder.org
Jupyter Notebook
27
star
20

outputs

A collection of React components for displaying rich Jupyter display objects
TypeScript
26
star
21

nteract-next

Iterating on the next version of nteract
TypeScript
22
star
22

coffee_boat

☕⛵WIP PySpark dependency management
Jupyter Notebook
22
star
23

docs

🏖 User written and user focused documentation for working with nteract! Join us!
HTML
18
star
24

nes

🎮 Notebook Enterprise Summit
18
star
25

initiatives

📑 Top level initiatives that our team is working on
17
star
26

spawnteract

🚸 Spawn Jupyter Kernels
JavaScript
15
star
27

mathjax

React context wrapper around the MathJax API
TypeScript
15
star
28

papermillr

R bindings for papermill
R
14
star
29

notebook-preview

🎥 [DEPRECATED] Lightweight preview of a notebook, nteract style
JavaScript
14
star
30

global-sprint

🌐 Build. Share. Learn. 🗓 July 28 - August 3 2018
13
star
31

ick

🏮 Interactive Console Experiments
JavaScript
13
star
32

rx-jupyter

🎈 RxJS 5 bindings for the Jupyter Notebook API
13
star
33

play

The code base for the nteract Play app
JavaScript
13
star
34

transformime

🚚 Mimetype + data -> HTMLElement
13
star
35

ion

A React-backed UI Toolkit
JavaScript
12
star
36

snakestagram

🐍 📦 Snake in a box, conda environments to go
Shell
12
star
37

docs-old

📔 docs are awesome DEPRECATED - ARCHIVE ON 11-30-2018
12
star
38

kernel-relay

kernel-relay is a GraphQL service for interfacing with one or more Jupyter kernels
TypeScript
12
star
39

term-launcher

💻 🚀 Launch terminals and jupyter consoles from node.
JavaScript
10
star
40

enchannel

💱 standardizing how a frontend communicates with a kernel
JavaScript
10
star
41

kernelspecs

📇 Find Jupyter kernelspecs on a system
JavaScript
9
star
42

markdown

A package for rendering Markdown within Jupyter notebooks
TypeScript
9
star
43

meeting-minutes

📝 Minutes from nteract monthly contributor meeting; reports and metrics
9
star
44

cabinet

📕 Exploring a new notebook container format
9
star
45

jupyter-paths

🌇 Pure JavaScript implementation of jupyter --paths --json
JavaScript
9
star
46

oauth-server

🔏 Little OAuth Handler for Gist Publishing
JavaScript
8
star
47

react-jupyter-display-area

📊 Jupyter Display Area as a React Component
7
star
48

improved-spark-viz

🐼 WIP Improved visualizations in Spark
Python
7
star
49

jupyter-display-area

🚫 Prototype Web Component for Jupyter Display Areas
JavaScript
6
star
50

logos

✨ A place to collaborate on nteract logos ✨
TypeScript
6
star
51

ipypandex

A package for automatically turning on Data Explorer in Pandas for an IPython Jupyter kernel.
Python
6
star
52

transformime-react

🚫 Mimetype + data -> React Element
5
star
53

jupyer-go-api

Jupyter API server implemented in Go
Go
5
star
54

enchannel-zmq-backend

ZeroMQ backend implementation for enchannel
TypeScript
5
star
55

graphql-schema-exploration

Exploring a GraphQL schema for notebooks
TypeScript
5
star
56

education

Developing strategies, designs, and curriculum to reach more and teach more
4
star
57

nteract.github.io

:octocat: GitHub Pages for nteract *** DEPRECATED - See https://github.com/nteract/nteract.io ***
CSS
4
star
58

any-vega

Interface with any vega or vega-lite version
JavaScript
4
star
59

galleria

GitHub bot that uploads screenshots from PR builds
JavaScript
4
star
60

ui-refresh

TypeScript
4
star
61

nteract-monthly-newsletter

Plain-text versions of the nteract monthly newsletter
3
star
62

inodejs

Experimenting with jp-kernel
Jupyter Notebook
3
star
63

notebook-render

TypeScript
3
star
64

minimal-plotly

📊 A minimal version of the plotly library
JavaScript
3
star
65

commuter-on-glitch

Running commuter on glitch.me
3
star
66

jupyter-kernel-launcher

Launches Jupyter kernels from a Node environment
CoffeeScript
3
star
67

libzmq-win

📦 Windows binaries of the ØMQ library
3
star
68

directory-listing

A set of React components for creating directory list views in nteract-based applications
TypeScript
3
star
69

design

✏️ Mocking up nteract UI and UX
3
star
70

transformime-jupyter-transformers

🚫 Transformers for Jupyter-specific MIME types.
JavaScript
3
star
71

desktop-integration-tests

Testing nteract for desktop on the regular with a full test suite
Jupyter Notebook
2
star
72

ipython-paths

🌆 Paths for IPython before Jupyter 4.0
JavaScript
2
star
73

jupyter-session

A tool for interacting with Jupyter kernels
CoffeeScript
2
star
74

models

Exploring the Python side of nteract's setIn based models
2
star
75

specs

📑 Specifications for APIs, processes, and protocols.
2
star
76

zmq-static

🚫 Statically linked bindings for node.js and io.js to ZeroMQ
JavaScript
2
star
77

associator

👥 Associates file extensions to an application
JavaScript
2
star
78

content-providers

A collection of content providers for accessing notebooks in different storage locations
2
star
79

notebook-test-data

📚 Jupyter notebook test data. Feel free to add more!
Jupyter Notebook
2
star
80

transformime-commonmark

🚫 Transformer using commonmark.js
JavaScript
2
star
81

dx_jlab

dx on JupyterLab
Python
1
star
82

assets

📦 static assets for nteract/nteract
CSS
1
star
83

enchannel-in-memory

💻 In memory enchannel backend
1
star
84

octicons

GitHub Octicons set packaged as React components.
TypeScript
1
star
85

vega-embed-v2

Bundled embedded Vega 2 and Vega-Lite 1
JavaScript
1
star
86

commutable-perf

🐢 Addressing performance of the commutable library
Python
1
star
87

naming

📛 Shipping releases with style
1
star
88

ui-web

UI Repo for web and play components.
TypeScript
1
star
89

fs-observable

An Observable wrapper around Node's fs APIs
TypeScript
1
star
90

enchannel-socketio-backend

🔌 enchannel powered by socket.io, to be used with kernel-relay
JavaScript
1
star
91

styled-blueprintjsx

npm package that wraps Blueprint stylesheets as a styled-component global style to avoid needing a CSS loader
TypeScript
1
star
92

design-docs

Design docs for the nteract ecosystem
1
star
93

jupyter-transport-wrapper

A thin wrapper abstracting over ZMQ and (evenutally) websockets for Jupyter
JavaScript
1
star
94

commutable-models

♻️ 📘 Experimenting with commutable and flow-immutable-models
JavaScript
1
star