• This repository has been archived on 20/Oct/2020
  • Stars
    star
    169
  • Rank 224,453 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 10 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Useful tools for working with iterators

iterstuff

Useful tools for working with iterators

If the python2 itertools module is the Swiss Army Knife of functions for iterables, iterstuff is the cut-down single-blade version that you can keep on your keyring.

You can install iterstuff from pypi using pip:

pip install iterstuff

Lookahead

The Lookahead class is the main feature of iterstuff. It 'wraps' an iterable and allows:

  • Detection of the end of the generator using the atend property
  • 'Peeking' at the next item to be yielded using the peek property

Note that 'wrapping' means that the Lookahead will advance the wrapped iterable (by calling next) as needed. As per the comments on the Lookahead __init__, creating the Lookahead will advance to the first element of the wrapped iterable immediately. After that, iterating over the Lookahead will also iterate over the wrapped iterable.

We'll look at examples in a moment, but first here's a summary of usage:

>>> # Create a generator that will yield three integers
>>> g = xrange(3)
>>> # Wrap it in a Lookahead
>>> from iterstuff import Lookahead
>>> x = Lookahead(g)

Now we can use the properties of the Lookahead to check whether we're at the start and/or end of the generator sequence, and to look at the next element that would be yielded:

>>> x.atstart
True
>>> x.atend
False
>>> x.peek
0

Let's grab the first element and see how the properties change:

>>> x.next()
0
>>> x.atstart
False
>>> x.atend
False
x.peek
1

We have two ways to iterate over a sequence wrapped in a Lookahead:

>>> # The usual way
>>> x = Lookahead(xrange(3))
>>> for y in x: print y
0
1
2

>>> # By checking for the end of the sequence
>>> x = Lookahead(xrange(3))
>>> while not x.atend:
...     y = x.next()
...     print y
...     
0
1
2

And we can detect a completely empty Lookahead:

>>> if x.atstart and x.atend:
...    # x is an empty Lookahead

The obvious question is: how is this useful?

Repeating a takewhile

The itertools.takewhile function can yield items from an iterable while some condition is satisfied. However, it only yields items up until the condition is no longer satisfied, then it stops, after testing the next element. Let's see what happens if we want to use it to break a sequence of characters into letters and digits.

>>> from itertools import takewhile
>>> # Build a generator that returns a sequence
>>> data = iter('abcd123ghi')
>>>
>>> # Ok, let's get the characters that are not digits
>>> print list(takewhile(lambda x: not x.isdigit(), data))
['a', 'b', 'c', 'd']
>>> 
>>> # Great, now let's get the digits
>>> print list(takewhile(lambda x: x.isdigit(), data))
['2', '3']

What happened to '1'? When we were processing the non-digits, the takewhile function read the '1' from data, passed it to the lambda and when that returned False, terminated. But of course, by then the '1' had already been consumed, so when we started the second takewhile, the first character it got was '2'.

We can solve this with a Lookahead. Here's a repeatable takewhile equivalent (that's in the iterstuff module):

def repeatable_takewhile(predicate, iterable):
    """
    Like itertools.takewhile, but does not consume the first
    element of the iterable that fails the predicate test.
    """
    
    # Assert that the iterable is a Lookahead. The act of wrapping
    # an iterable in a Lookahead consumes the first element, so we
    # cannot do the wrapping inside this function.
    if not isinstance(iterable, Lookahead):
        raise TypeError("The iterable parameter must be a Lookahead")
    
    # Use 'peek' to check if the next element will satisfy the
    # predicate, and yield while this is True, or until we reach
    # the end of the iterable.
    while (not iterable.atend) and predicate(iterable.peek):
        yield iterable.next()

Let's see how this behaves:

>>> from iterstuff import repeatable_takewhile, Lookahead
>>> data = Lookahead('abcd123ghi')
>>> print list(repeatable_takewhile(lambda x: not x.isdigit(), data))
['a', 'b', 'c', 'd']
>>> print list(repeatable_takewhile(lambda x: x.isdigit(), data))
['1', '2', '3']

Examine data before it's used

The pandas library can build a DataFrame from almost any sequence of records. The DataFrame constructor checks the first record to determine the data types of the columns. If we pass a generator data to the DataFrame constructor, almost the first thing that happens is that data is turned into a list, so that pandas can access data[0] to examine the data types. If your generator yields many records, though, this is bad - it's just built a list of those many records in memory, effectively doubling the amount of memory used (memory to hold the list plus memory to hold the DataFrame).

A Lookahead allows code to peek ahead at the next row. So we could do the same job as pandas in a different way:

# Wrap the data in a Lookahead so we can peek at the first row
peekable = Lookahead(data)

# If we're at the end of the Lookahead, there's no data
if peekable.atend:
    return
    
# Grab the first row so we can look at the data types
first_row = peekable.peek

# ...process the data types...

Simple pairwise

There's a beautiful recipe in the itertools documentation for yielding pairs from an iterable:

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

Beautiful, but a little complex. We can make a simpler version with a Lookahead:

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    it = Lookahead(iterable)
    while not it.atend:
        yield it.next(), it.peek

Let's try it:

>>> data = iter('abcd123ghi')
>>> print list(pairwise(data))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', '1'), ('1', '2'), ('2', '3'), ('3', 'g'), ('g', 'h'), ('h', 'i'), ('i', None)]

Chunking

Chunking is like using the repeatable takewhile, but for a specific use-case.

Suppose you're reading data from a database: the results of a big query over a LEFT OUTER JOIN between several tables. Let's create a simplified (but real-world) example.

We store data that relates to timing of web pages. We store an event for each page, and for each event we store multiple values. Our tables look something like:

Event
ID  Created             Session URL
01  2014-12-17 01:00:00 ab12f43 http://www.mobify.com/
02  2014-12-17 01:00:01 ab12f43 http://www.mobify.com/jobs
...and so on for millions of events...

Value
Event_ID  Name              Value
01        DOMContentLoaded     83
01        Load                122
02        DOMContentLoaded     64
02        Load                345
...and so on for millions of values for millions of events...

At the end of every day, we process the records for that day, by doing a query like:

SELECT *
FROM Event LEFT OUTER JOIN Value ON Event.ID = Value.Event_ID
ORDER BY Event.ID

We'll probably end up with something like a SQLAlchemy ResultProxy or a Django QuerySet - an iterable thing that yields records (and here we're assuming that your database will stream the results back to your Python client so that you can process much more data than you could ever fit into memory). Let's call that iterable thing records.

What we want to do is to process each event. The problem is that if we just iterate over the records:

for record in records:
    print record.ID, record.Created, record.Name, record.Value

...we'll get one record per value - more than one record per event:

01 2014-12-17 01:00:00 DOMContentLoaded 83
01 2014-12-17 01:00:00 Load 122
02 2014-12-17 01:00:01 DOMContentLoaded 64
02 2014-12-17 01:00:01 Load 345

It's be better if we could handle all the records for one event together, then all the records for the next event, and so on.

We could use repeatable_takewhile to grab all the records belonging to the same event:

it = Lookahead(records)

while not it.atend:
    current_event_id = it.peek.ID
    event_records = list(
        repeatable_takewhile(
            lambda r: r.ID == current_event_id,
            it
        )
    )
    
    # Now we have just the records for the next event
    ...process...

But because this is a common use case, Lookahead has a helper function to make this even easier. The chunked function takes a function to extract a 'key' value from each element, and yields successive iterables, each of which has records with the same key value.

from iterstuff import chunked
for records_for_events in chunked(
        records,
        lambda r: r.ID
    ):
    # records_for_events is a sequence of records for
    # one event.
    ...process...

In fact, we can use chunking in the character class problem we showed earlier:

>>> data = (x for x in 'abcd123ghi')
>>> for charset in chunked(data, lambda c: c.isdigit()):
...     print list(charset)
...     
['a', 'b', 'c', 'd']
['1', '2', '3']
['g', 'h', 'i']

Batching

The batch method is a simplification of a common use for itertools.islice.

Suppose your generator yields records that you're reading from a file, or a database. Suppose that there may be many hundreds of thousands of records, or even millions, so you can't fit them all into memory, and you need to do them in batches of 1000.

Here's one way to do this using islice:

from itertools import islice
CHUNK = 1000
while True:
    # Listify the records so that we can check if
    # there were any returned.
    chunk = list(islice(records, CHUNK))
    if not chunk:
        break
    
    # Process the records in this chunk
    for record in chunk:
        process(record)

Or the iterstuff batch function will do this for you in a simpler way:

from iterstuff import batch
CHUNK = 1000
for chunk in batch(records, CHUNK):
    # Chunk is an iterable of up to CHUNK records
    for record in chunk:
        process(record)

Here's an elegant batch solution provided by Hamish Lawson for ActiveState recipes: http://code.activestate.com/recipes/303279-getting-items-in-batches/)

from itertools import islice, chain
def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([batchiter.next()], batchiter)

Note how this uses a call to batchiter.next() to cause StopIteration to be raised when the source iterable is exhausted. Because this consumes an element, itertools.chain needs to be used to 'push' that element back onto the head of the chunk. Using a Lookahead allows us to peek at the next element of the iterable and avoid the push. Here's how iterstuff.batch works:

def batch(iterable, size):
    # Wrap an enumeration of the iterable in a Lookahead so that it
    # yields (count, element) tuples
    it = Lookahead(enumerate(iterable))

    while not it.atend:
        # Set the end_count using the count value
        # of the next element.
        end_count = it.peek[0] + size

        # Yield a generator that will then yield up to
        # 'size' elements from 'it'.
        yield (
            element
            for counter, element in repeatable_takewhile(
                # t[0] is the count part of each element
                lambda t: t[0] < end_count,
                it
            )
        )

A Conclusion

Python generators are a wonderful, powerful, flexible language feature. The atend and peek properties of the Lookahead class enable a whole set of simple recipes for working with generators.

You can see examples of use in the unit tests for this package, and run them by executing the tests.py file directly.

Thanks

...to the Engineering Gang at Mobify ...to https://github.com/landonjross for Python3 support

More Repositories

1

mobifyjs

Mobify.js was a JavaScript framework for optimizing sites for mobile. It is no longer actively maintained.
JavaScript
646
star
2

pikabu

Off-Canvas flyout menu
HTML
453
star
3

bellows

A responsive, mobile-first accordion UI module for progressive disclosure on the web.
JavaScript
140
star
4

branching-strategy

πŸ”€ Branching strategies! For Git!
135
star
5

scooch

A mobile-first JavaScript-driven content and image carousel
JavaScript
125
star
6

mobify-code-style

πŸ“š Mobify's coding style and standards!
Python
117
star
7

handbooks

A collection of Mobify's internal handbooks and styleguides
90
star
8

developer-values

πŸ™Œ Principles to follow when building software.
76
star
9

commercecloud-ocapi-client

Salesforce Commerce Cloud Open Commerce API (OCAPI) for Node and browsers πŸ›’
JavaScript
70
star
10

nightwatch-commands

A set of Mobify specific custom commands for Nightwatch.js
JavaScript
60
star
11

magnifik

An image zooming module for mobile
JavaScript
54
star
12

sass-sleuth

Adapts Webkit Web Inspector to handle Sass line number debugging information
36
star
13

mobify-client

Mobify CLI and Tools for use with the Mobify.js Adaptation Framework
JavaScript
23
star
14

pinny

A mobile-first content fly-in UI plugin
JavaScript
23
star
15

mobify-modules

DEPRECATED! See https://github.com/mobify/mobify.github.io
CSS
23
star
16

meowbify

🐈 Meowbify
CoffeeScript
17
star
17

mobifyjs-demos

Demo mobile sites created using Mobify.js
JavaScript
15
star
18

hijax

XHR Proxy to intercept AJAX calls independent of libraries.
JavaScript
13
star
19

spline

⚠️ DEPRECATED. Spline is a mixin and function library for Sass. It makes writing stylesheets for mobile-first builds faster and easier. Spline provides methods to manipulate text, use web & icon fonts, create CSS3 shapes, and much more.
CSS
11
star
20

ui-kit

πŸ“± Quickly and efficiently design PWAs for ecommerce brands.
10
star
21

hybris-occ-client

πŸ› Hybris Omni Commerce Connect (OCC) client for Node and browsers.
HTML
9
star
22

capturejs

Transform your DOM to be your API for your front-end application
JavaScript
9
star
23

mobify-data-guide

πŸ“š List of readings that would be useful in getting started on with working with any data set.
7
star
24

vellum

Default project styles for a mobile-first AdaptiveJS build.
HTML
5
star
25

redux-runtypes-schema

Redux store validation via runtypes
TypeScript
5
star
26

deckard

Device OS and Browser detection
HTML
5
star
27

astro-scaffold

πŸ— Starting point for building Astro applications!
JavaScript
4
star
28

split-test

An A/B split test library for persisting split choices
JavaScript
4
star
29

stencil

DEPRECIATED - The latest Stencil development is currently taking place in the Adaptive.js repo.
CSS
4
star
30

imageresize-client

Client code for the Mobify image resizing API
HTML
4
star
31

hora.js

Hora.js: Custom Google Analytics Tracking
HTML
3
star
32

lockup

A mobile first scroll blocking plugin.
HTML
3
star
33

navitron

A mobile optimized sliding navigation plugin.
JavaScript
3
star
34

tozee

Alphabet scroll jumping
HTML
3
star
35

python-appfigures

API wrapper for the appfigures.com
Python
3
star
36

webpush-payload-encryption

Python code to handle the encryption of push notifications for Firefox and Chrome
Python
3
star
37

calypso

A set of tools for better Docker deployments to AWS Elastic Beanstalk.
Python
3
star
38

multiple-service-workers

Experimenting with multiple service workers
HTML
2
star
39

shade

A mobile-first plugin for creating scroll and touch blocking overlays for content
JavaScript
2
star
40

jazzcat-client

Client code for Mobify's javascript optimization service
JavaScript
2
star
41

level

⚠️ DEPRECATED. CST's own personal Normalize
CSS
2
star
42

styleandclass-planning

Style & Class Meetup Public Site
2
star
43

css-optimize

The client-library for optimizing CSS using the Jazzcat service
JavaScript
2
star
44

webpayments-test

πŸ’³ PaymentRequest browser API demo.
JavaScript
2
star
45

schemer

Schema comparison tool for Adaptive projects
JavaScript
2
star
46

dmit

Wrap docker-machine in useful functionality
Shell
2
star
47

descript

Manage desktop scripts in a simple way in Adaptive.js
HTML
2
star
48

heroku-buildpack-openssl

Buildpack for OpenSSL 1.0.2e on Heroku
Shell
2
star
49

selector-utils

Selector utility functions that can be selectively included in your Adaptive.js builds.
JavaScript
1
star
50

generator-nightwatch

A yeoman generator for Nightwatch testing framework
JavaScript
1
star
51

deprecated-mobify-tech-prtnr-na03-dw

The Demandware Demonstration Integration
CSS
1
star
52

stencil-fancy-select

A fancier select replacement component using Javascript
CSS
1
star
53

generator-progressive-web

❗️DEPRECATED❗️-- Generator for creating Progressive Web projects
Shell
1
star
54

adaptivejs-split-test-examples

Split test examples for Adaptive.js!
JavaScript
1
star
55

depot

A collection of mobile-focused lo-fi wireframe components built in Photoshop.
1
star
56

stencil-tabs

JavaScript
1
star