D3DShot
D3DShot is a pure Python implementation of the Windows Desktop Duplication API. It leverages DXGI and Direct3D system libraries to enable extremely fast and robust screen capture functionality for your Python scripts and applications on Windows.
D3DShot:
- Is by far the fastest way to capture the screen with Python on Windows 8.1+
- Is very easy to use. If you can remember 10-ish methods, you know the entire thing.
- Covers all common scenarios and use cases:
- Screenshot to memory
- Screenshot to disk
- Screenshot to memory buffer every X seconds (threaded; non-blocking)
- Screenshot to disk every X seconds (threaded; non-blocking)
- High-speed capture to memory buffer (threaded; non-blocking)
- Captures to PIL Images out of the box. Gracefully adds output options if NumPy or PyTorch can be found.
- Detects displays in just about any configuration: Single monitor, multiple monitors on one adapter, multiple monitors on multiple adapters.
- Handles display rotation and scaling for you
- Supports capturing specific regions of the screen
- Is robust and very stable. You can run it for hours / days without performance degradation
- Is even able to capture DirectX 11 / 12 exclusive fullscreen applications and games!
TL;DR Quick Code Samples
Screenshot to Memory
import d3dshot
d = d3dshot.create()
d.screenshot()
Out[1]: <PIL.Image.Image image mode=RGB size=2560x1440 at 0x1AA7ECB5C88>
Screenshot to Disk
import d3dshot
d = d3dshot.create()
d.screenshot_to_disk()
Out[1]: './1554298682.5632973.png'
Screen Capture for 5 Seconds and Grab the Latest Frame
import d3dshot
import time
d = d3dshot.create()
d.capture()
time.sleep(5) # Capture is non-blocking so we wait explicitely
d.stop()
d.get_latest_frame()
Out[1]: <PIL.Image.Image image mode=RGB size=2560x1440 at 0x1AA044BCF60>
Screen Capture the Second Monitor as NumPy Arrays for 3 Seconds and Grab the 4 Latest Frames as a Stack
import d3dshot
import time
d = d3dshot.create(capture_output="numpy")
d.display = d.displays[1]
d.capture()
time.sleep(3) # Capture is non-blocking so we wait explicitely
d.stop()
frame_stack = d.get_frame_stack((0, 1, 2, 3), stack_dimension="last")
frame_stack.shape
Out[1]: (1080, 1920, 3, 4)
This is barely scratching the surface... Keep reading!
Requirements
- Windows 8.1+ (64-bit)
- Python 3.6+ (64-bit)
Installation
pip install d3dshot
D3DShot leverages DLLs that are already available on your system so the dependencies are very light. Namely:
- comtypes: Internal use. To preserve developer sanity while working with COM interfaces.
- Pillow: Default Capture Output. Also used to save to disk as PNG and JPG.
These dependencies will automatically be installed alongside D3DShot; No need to worry about them!
Extra Step: Laptop Users
Windows has a quirk when using Desktop Duplication on hybrid-GPU systems. Please see the wiki article before attempting to use D3DShot on your system.
Concepts
Capture Outputs
The desired Capture Output is defined when creating a D3DShot instance. It defines the type of all captured images. By default, all captures will return PIL.Image objects. This is a good option if you mostly intend to take screenshots.
# Captures will be PIL.Image in RGB mode
d = d3dshot.create()
d = d3dshot.create(capture_output="pil")
D3DShot is however quite flexible! As your environment meets certain optional sets of requirements, more options become available.
If NumPy is available
# Captures will be np.ndarray of dtype uint8 with values in range (0, 255)
d = d3dshot.create(capture_output="numpy")
# Captures will be np.ndarray of dtype float64 with normalized values in range (0.0, 1.0)
d = d3dshot.create(capture_output="numpy_float")
If NumPy and PyTorch are available
# Captures will be torch.Tensor of dtype uint8 with values in range (0, 255)
d = d3dshot.create(capture_output="pytorch")
# Captures will be torch.Tensor of dtype float64 with normalized values in range (0.0, 1.0)
d = d3dshot.create(capture_output="pytorch_float")
If NumPy and PyTorch are available + CUDA is installed and torch.cuda.is_available()
# Captures will be torch.Tensor of dtype uint8 with values in range (0, 255) on device cuda:0
d = d3dshot.create(capture_output="pytorch_gpu")
# Captures will be torch.Tensor of dtype float64 with normalized values in range (0.0, 1.0) on device cuda:0
d = d3dshot.create(capture_output="pytorch_float_gpu")
Trying to use a Capture Output for which your environment does not meet the requirements will result in an error.
Singleton
Windows only allows 1 instance of Desktop Duplication per process. To make sure we fall in line with that limitation to avoid issues, the D3DShot class acts as a singleton. Any subsequent calls to d3dshot.create()
will always return the existing instance.
d = d3dshot.create(capture_output="numpy")
# Attempting to create a second instance
d2 = d3dshot.create(capture_output="pil")
# Only 1 instance of D3DShot is allowed per process! Returning the existing instance...
# Capture output remains 'numpy'
d2.capture_output.backend
# Out[1]: <d3dshot.capture_outputs.numpy_capture_output.NumpyCaptureOutput at 0x2672be3b8e0>
d == d2
# Out[2]: True
Frame Buffer
When you create a D3DShot instance, a frame buffer is also initialized. It is meant as a thread-safe, first-in, first-out way to hold a certain quantity of captures and is implemented as a collections.deque
.
By default, the size of the frame buffer is set to 60. You can customize it when creating your D3DShot object.
d = d3dshot.create(frame_buffer_size=100)
Be mindful of RAM usage with larger values; You will be dealing with uncompressed images which use up to 100 MB each depending on the resolution.
The frame buffer can be accessed directly with d.frame_buffer
but the usage of the utility methods instead is recommended.
The buffer is used by the following methods:
d.capture()
d.screenshot_every()
It is always automatically cleared before starting one of these operations.
Displays
When you create a D3DShot instance, your available displays will automatically be detected along with all their relevant properties.
d.displays
Out[1]:
[<Display name=BenQ XL2730Z (DisplayPort) adapter=NVIDIA GeForce GTX 1080 Ti resolution=2560x1440 rotation=0 scale_factor=1.0 primary=True>,
<Display name=BenQ XL2430T (HDMI) adapter=Intel(R) UHD Graphics 630 resolution=1920x1080 rotation=0 scale_factor=1.0 primary=False>]
By default, your primary display will be selected. At all times you can verify which display is set to be used for capture.
d.display
Out[1]: <Display name=BenQ XL2730Z (DisplayPort) adapter=NVIDIA GeForce GTX 1080 Ti resolution=2560x1440 rotation=0 scale_factor=1.0 primary=True>
Selecting another display for capture is as simple as setting d.display
to another value from d.displays
d.display = d.displays[1]
d.display
Out[1]: <Display name=BenQ XL2430T (HDMI) adapter=Intel(R) UHD Graphics 630 resolution=1080x1920 rotation=90 scale_factor=1.0 primary=False>
Display rotation and scaling is detected and handled for you by D3DShot:
- Captures on rotated displays will always be in the correct orientation (i.e. matching what you see on your physical displays)
- Captures on scaled displays will always be in full, non-scaled resolution (e.g. 1280x720 at 200% scaling will yield 2560x1440 captures)
Regions
All capture methods (screenshots included) accept an optional region
kwarg. The expected value is a 4-length tuple of integers that is to be structured like this:
(left, top, right, bottom) # values represent pixels
For example, if you want to only capture a 200px by 200px region offset by 100px from both the left and top, you would do:
d.screenshot(region=(100, 100, 300, 300))
If you are capturing a scaled display, the region will be computed against the full, non-scaled resolution.
If you go through the source code, you will notice that the region cropping happens after a full display capture. That might seem sub-optimal but testing has revealed that copying a region of the GPU D3D11Texture2D to the destination CPU D3D11Texture2D using CopySubresourceRegion is only faster when the region is very small. In fact, it doesn't take long for larger regions to actually start becoming slower than the full display capture using this method. To make things worse, it adds a lot of complexity by having the surface pitch not match the buffer size and treating rotated displays differently. It was therefore decided that it made more sense to stick to CopyResource in all cases and crop after the fact.
Usage
Create a D3DShot instance
import d3dshot
d = d3dshot.create()
create
accepts 2 optional kwargs:
capture_output
: Which capture output to use. See the Capture Outputs section under Conceptsframe_buffer_size
: The maximum size the frame buffer can grow to. See the Frame Buffer section under Concepts
Do NOT import the D3DShot class directly and attempt to initialize it yourself! The create
helper function initializes and validates a bunch of things for you behind the scenes.
Once you have a D3DShot instance in scope, we can start doing stuff with it!
List the detected displays
d.displays
Select a display for capture
Your primary display is selected by default but if you have a multi-monitor setup, you can select another entry in d.displays
d.display = d.displays[1]
Take a screenshot
d.screenshot()
screenshot
accepts 1 optional kwarg:
region
: A region tuple. See the Regions section under Concepts
Returns: A screenshot with a format that matches the capture output you selected when creating your D3DShot object
Take a screenshot and save it to disk
d.screenshot_to_disk()
screenshot_to_disk
accepts 3 optional kwargs:
directory
: The path / directory where to write the file. If omitted, the working directory of the program will be usedfile_name
: The file name to use. Permitted extensions are: .png, .jpg. If omitted, the file name will be<time.time()>.png
region
: A region tuple. See the Regions section under Concepts
Returns: A string representing the full path to the saved image file
Take a screenshot every X seconds
d.screenshot_every(X) # Where X is a number representing seconds
This operation is threaded and non-blocking. It will keep running until d.stop()
is called. Captures are pushed to the frame buffer.
screenshot_every
accepts 1 optional kwarg:
region
: A region tuple. See the Regions section under Concepts
Returns: A boolean indicating whether or not the capture thread was started
Take a screenshot every X seconds and save it to disk
d.screenshot_to_disk_every(X) # Where X is a number representing seconds
This operation is threaded and non-blocking. It will keep running until d.stop()
is called.
screenshot_to_disk_every
accepts 2 optional kwargs:
directory
: The path / directory where to write the file. If omitted, the working directory of the program will be usedregion
: A region tuple. See the Regions section under Concepts
Returns: A boolean indicating whether or not the capture thread was started
Start a high-speed screen capture
d.capture()
This operation is threaded and non-blocking. It will keep running until d.stop()
is called. Captures are pushed to the frame buffer.
capture
accepts 2 optional kwargs:
target_fps
: How many captures per second to aim for. The effective capture rate will go under if the system can't keep up but it will never go over this target. It is recommended to set this to a reasonable value for your use case in order not to waste system resources. Default is set to 60.region
: A region tuple. See the Regions section under Concepts
Returns: A boolean indicating whether or not the capture thread was started
Grab the latest frame from the buffer
d.get_latest_frame()
Returns: A frame with a format that matches the capture output you selected when creating your D3DShot object
Grab a specific frame from the buffer
d.get_frame(X) # Where X is the index of the desired frame. Needs to be < len(d.frame_buffer)
Returns: A frame with a format that matches the capture output you selected when creating your D3DShot object
Grab specific frames from the buffer
d.get_frames([X, Y, Z, ...]) # Where X, Y, Z are valid indices to desired frames
Returns: A list of frames with a format that matches the capture output you selected when creating your D3DShot object
Grab specific frames from the buffer as a stack
d.get_frame_stack([X, Y, Z, ...], stack_dimension="first|last") # Where X, Y, Z are valid indices to desired frames
Only has an effect on NumPy and PyTorch capture outputs.
get_frame_stack
accepts 1 optional kwarg:
stack_dimension
: One of first, last. Which axis / dimension to perform the stack on
Returns: A single array stacked on the specified dimension with a format that matches the capture output you selected when creating your D3DShot object. If the capture output is not stackable, returns a list of frames.
Dump the frame buffer to disk
The files will be named according to this convention: <frame buffer index>.png
d.frame_buffer_to_disk()
frame_buffer_to_disk
accepts 1 optional kwarg:
directory
: The path / directory where to write the files. If omitted, the working directory of the program will be used
Returns: None
Performance
Measuring the exact performance of the Windows Desktop Duplication API proves to be a little complicated because it will only return new texture data if the contents of the screen has changed. This is optimal for performance but it makes it difficult to express in terms of frames per second, the measurement people tend to expect for benchmarks. Ultimately the solution ended up being to run a high FPS video game on the display to capture to make sure the screen contents is different at all times while benchmarking.
As always, remember that benchmarks are inherently flawed and highly depend on your individual hardware configuration and other circumstances. Use the numbers below as a relative indication of what to expect from D3DShot, not as some sort of absolute truth.
2560x1440 on NVIDIA GTX 1080 Ti | 1920x1080 on Intel UHD Graphics 630 | 1080x1920 (vertical) on Intel UHD Graphics 630 | |
---|---|---|---|
"pil" | 29.717 FPS | 47.75 FPS | 35.95 FPS |
"numpy" | 57.667 FPS | 58.1 FPS | 58.033 FPS |
"numpy_float" | 18.783 FPS | 29.05 FPS | 27.517 FPS |
"pytorch" | 57.867 FPS | 58.1 FPS | 34.817 FPS |
"pytorch_float" | 18.767 FPS | 28.367 FPS | 27.017 FPS |
"pytorch_gpu" | 27.333 FPS | 35.767 FPS | 34.8 FPS |
"pytorch_float_gpu" | 27.267 FPS | 37.383 FPS | 35.033 FPS |
The absolute fastest capture outputs appear to be "numpy" and unrotated "pytorch"; all averaging around 58 FPS. In Python land, this is FAST!
How is the "numpy" capture output performance that good?
NumPy arrays have a ctypes interface that can give you their raw memory address (X.ctypes.data
). If you have the memory address and size of another byte buffer, which is what we end up with by processing what returns from the Desktop Duplication API, you can use ctypes.memmove
to copy that byte buffer directly to the NumPy structure, effectively bypassing as much Python as possible.
In practice it ends up looking like this:
ctypes.memmove(np.empty((size,), dtype=np.uint8).ctypes.data, pointer, size)
This low-level operation is extremely fast, leaving everything else that would normally compete with NumPy in the dust.
Why is the "pytorch" capture output slower on rotated displays?
Don't tell anyone but the reason it can compete with NumPy in the first place is only because... it is generated from a NumPy array built from the method above! If you sniff around the code, you will indeed find torch.from_numpy()
scattered around. This pretty much matches the speed of the "numpy" capture output 1:1, except when dealing with a rotated display. Display rotation is handled by np.rot90()
calls which yields negative strides on that array. Negative strides are understood and perform well under NumPy but are still unsupported in PyTorch at the time of writing. To address this, an additional copy operation is needed to bring it back to a contiguous array which imposes a performance penalty.
Why is the "pil" capture output, being the default, not the fastest?
PIL has no ctypes interface like NumPy so a bytearray needs to be read into Python first and then fed to PIL.Image.frombytes()
. This is still fast in Python terms, but it just cannot match the speed of the low-level NumPy method.
It remains the default capture output because:
- PIL Image objects tend to be familiar to Python users
- It's a way lighter / simpler dependency for a library compared to NumPy or PyTorch
Why are the float versions of capture outputs slower?
The data of the Direct3D textures made accessible by the Desktop Duplication API is formatted as bytes. To represent this data as normalized floats instead, a type cast and element-wise division needs to be performed on the array holding those bytes. This imposes a major performance penalty. Interestingly, you can see this performance penalty mitigated on GPU PyTorch tensors since the element-wise division can be massively parallelized on the device.