• Stars
    star
    529
  • Rank 83,810 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 11 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib

scandir, a better directory iterator and faster os.walk()

scandir on PyPI (Python Package Index) GitHub Actions Tests

scandir() is a directory iteration function like os.listdir(), except that instead of returning a list of bare filenames, it yields DirEntry objects that include file type and stat information along with the name. Using scandir() increases the speed of os.walk() by 2-20 times (depending on the platform and file system) by avoiding unnecessary calls to os.stat() in most cases.

Now included in a Python near you!

scandir has been included in the Python 3.5 standard library as os.scandir(), and the related performance improvements to os.walk() have also been included. So if you're lucky enough to be using Python 3.5 (release date September 13, 2015) you get the benefit immediately, otherwise just download this module from PyPI, install it with pip install scandir, and then do something like this in your code:

# Use the built-in version of scandir/walk if possible, otherwise
# use the scandir module version
try:
    from os import scandir, walk
except ImportError:
    from scandir import scandir, walk

PEP 471, which is the PEP that proposes including scandir in the Python standard library, was accepted in July 2014 by Victor Stinner, the BDFL-delegate for the PEP.

This scandir module is intended to work on Python 2.7+ and Python 3.4+ (and it has been tested on those versions).

Background

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling listdir() on each directory -- it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we're not talking about micro-optimizations. See more benchmarks in the "Benchmarks" section below.

Somewhat relatedly, many people have also asked for a version of os.listdir() that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories.

So as well as a faster walk(), scandir adds a new scandir() function. They're pretty easy to use, but see "The API" below for the full docs.

Benchmarks

Below are results showing how many times as fast scandir.walk() is than os.walk() on various systems, found by running benchmark.py with no arguments:

System version Python version Times as fast
Windows 7 64-bit 2.7.7 64-bit 10.4
Windows 7 64-bit SSD 2.7.7 64-bit 10.3
Windows 7 64-bit NFS 2.7.6 64-bit 36.8
Windows 7 64-bit SSD 3.4.1 64-bit 9.9
Windows 7 64-bit SSD 3.5.0 64-bit 9.5
Ubuntu 14.04 64-bit 2.7.6 64-bit 5.8
Mac OS X 10.9.3 2.7.5 64-bit 3.8

All of the above tests were done using the fast C version of scandir (source code in _scandir.c).

Note that the gains are less than the above on smaller directories and greater on larger directories. This is why benchmark.py creates a test directory tree with a standardized size.

The API

walk()

The API for scandir.walk() is exactly the same as os.walk(), so just read the Python docs.

scandir()

The full docs for scandir() and the DirEntry objects it yields are available in the Python documentation here. But below is a brief summary as well.

scandir(path='.') -> iterator of DirEntry objects for given path

Like listdir, scandir calls the operating system's directory iteration system calls to get the names of the files in the given path, but it's different from listdir in two ways:

  • Instead of returning bare filename strings, it returns lightweight DirEntry objects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned.
  • It returns a generator instead of a list, so that scandir acts as a true iterator instead of returning the full list immediately.

scandir() yields a DirEntry object for each file and sub-directory in path. Just like listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each DirEntry object has the following attributes and methods:

  • name: the entry's filename, relative to the scandir path argument (corresponds to the return values of os.listdir)
  • path: the entry's full path name (not necessarily an absolute path) -- the equivalent of os.path.join(scandir_path, entry.name)
  • is_dir(*, follow_symlinks=True): similar to pathlib.Path.is_dir(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
  • is_file(*, follow_symlinks=True): similar to pathlib.Path.is_file(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
  • is_symlink(): similar to pathlib.Path.is_symlink(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases
  • stat(*, follow_symlinks=True): like os.stat(), but the return value is cached on the DirEntry object; does not require a system call on Windows (except for symlinks); don't follow symbolic links (like os.lstat()) if follow_symlinks is False
  • inode(): return the inode number of the entry; the return value is cached on the DirEntry object

Here's a very simple example of scandir() showing use of the DirEntry.name attribute and the DirEntry.is_dir() method:

def subdirs(path):
    """Yield directory names not starting with '.' under given path."""
    for entry in os.scandir(path):
        if not entry.name.startswith('.') and entry.is_dir():
            yield entry.name

This subdirs() function will be significantly faster with scandir than os.listdir() and os.path.isdir() on both Windows and POSIX systems, especially on medium-sized or large directories.

Further reading

  • The Python docs for scandir
  • PEP 471, the (now-accepted) Python Enhancement Proposal that proposed adding scandir to the standard library -- a lot of details here, including rejected ideas and previous discussion

Flames, comments, bug reports

Please send flames, comments, and questions about scandir to Ben Hoyt:

http://benhoyt.com/

File bug reports for the version in the Python 3.5 standard library here, or file bug reports or feature requests for this module at the GitHub project page:

https://github.com/benhoyt/scandir

More Repositories

1

inih

Simple .INI file parser in C, good for embedded systems
C
2,402
star
2

goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support
Go
1,920
star
3

pygit

Just enough git (written in Python) to create a repo and push to GitHub
Python
314
star
4

countwords

Playing with counting word frequencies (and performance) in various languages.
Rust
306
star
5

protothreads-cpp

Protothread.h, a C++ port of Adam Dunkels' protothreads library
C++
178
star
6

pybktree

Python BK-tree data structure to allow fast querying of "close" matches
Python
169
star
7

ht

Simple hash table implemented in C
C
148
star
8

pyast64

Compile a subset of the Python AST to x64-64 assembler
Python
136
star
9

go-routing

Different approaches to HTTP routing in Go
Go
120
star
10

loxlox

Lox interpreter written in Lox
Python
112
star
11

mugo

Mugo, a toy compiler for a subset of Go that can compile itself
HTML
108
star
12

littlelang

A little language interpreter written in Go
Go
92
star
13

third

Third, a small Forth compiler for 8086 DOS
Forth
75
star
14

go-1brc

My Go solutions to the One Billion Row Challenge
Go
74
star
15

prig

Prig is for Processing Records In Go. Like AWK, but snobbish.
Go
64
star
16

gogit

Just enough of a git client (in Go) to init a repo, commit, and push to GitHub
Go
51
star
17

cdnupload

Upload your site's static files to a directory or CDN, using content-based hashing
Python
50
star
18

web-service-stdlib

Rewrite of Go RESTful API tutorial using only the stdlib
Go
49
star
19

simplelists

Tiny to-do list web app written in Go
Go
48
star
20

pas2go

Pascal to Go converter (converts a subset of Turbo Pascal 5.5)
Pascal
42
star
21

symplate

Symplate, a simple and fast Python template language (NOTE: no longer maintained; use Jinja2 or Mako instead)
Python
30
star
22

nibbleforth

A very compact stack machine (Forth) bytecode
Python
29
star
23

zztgo

Port of ZZT to Go (using a Pascal-to-Go converter)
Go
26
star
24

gosnip

Run small snippets of Go code from the command line
Go
24
star
25

python-pentomino

Pentomino puzzle solver using Python code generation
Python
20
star
26

benhoyt.github.com

Source code for my website
HTML
18
star
27

betterwalk

BetterWalk, a better and faster os.walk() for Python -- DEPRECATED, see my "scandir" project
Python
17
star
28

fe

Bruce Hoyt's Forth Editor (Dad's editor that I grew up coding with)
Forth
16
star
29

namedmutex

namedmutex.py, a simple ctypes wrapper for Win32 named mutexes
Python
15
star
30

soft404

Soft 404 (dead page) detector in Python
Python
13
star
31

io-performance

Code repo for https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/
Go
13
star
32

awkmake

Code to go with my article "The AWK book's 60-line version of Make"
Awk
13
star
33

repike

Rob Pike's simple regex matcher converted to Go
Go
11
star
34

benos

A tiny 32-bit Forth operating system I wrote when I was 16
Forth
8
star
35

counter

Fast hash table for counting short strings in Go
Go
7
star
36

false-forth

A False compiler and interpreter written in ANS Forth
Forth
7
star
37

py-1brc

Optimising the One Billion Row Challenge (1BRC) in Python
Python
5
star
38

mro

MRO is not an ORM - Map Rows to Objects with web.py
Python
2
star
39

snappass-test

Demo of Juju K8s sidecar charm with Pebble
Python
2
star
40

circle

Draw circles using the Bresenham Circle Algorithm in Go
Go
2
star
41

interpspeed

Test interpreter speed of various language VMs
Python
1
star
42

boggle

Boggle solver competition
Python
1
star