• Stars
    star
    602
  • Rank 71,573 (Top 2 %)
  • Language
    Assembly
  • License
    BSD Zero Clause L...
  • Created almost 3 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Contains the source code examples described in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

Intel® 64 and IA-32 Architectures Optimization Reference Manual Code Samples

This repository contains buildable versions of the example source files in the Intel Optimization Manual available here (https://software.intel.com/en-us/articles/intel-sdm). Assembly source code is provided for GCC, Clang and MSVC, using the Intel syntax. Unit tests are also provided for each of the samples.

Building on Linux and macOS

To run the unit tests

  1. cd to the root folder of this project
  2. mkdir build
  3. cd build
  4. cmake ..
  5. make && make test

GCC 8.1 (or clang 12 on macOS) or higher is required to build the unit tests. However, many of the newer examples, e.g, those that use AMX or AVX-512 FP16 instructions, require newer versions of the compilers to build; GCC 12 or clang 14. No errors will be reported when building, but examples built with toolchains that do not support the instructions that they test will simple report an error when run and exit.

The unit tests are compiled with --march=haswell and so a fourth-generation Intel® Core™ (Haswell) CPU or later is required to run them. Tests that execute instructions not present on fourth-generation Intel® Core™ (Haswell) will be skipped if the CPU on which they are run does not support those instructions.

The code samples can also be compiled with clang:

  1. cd to the root folder of this project
  2. mkdir clang-build
  3. cd clang-build
  4. CC=clang CXX=clang++ cmake ..
  5. make && make test

Building on Windows

To run the tests on Windows machine- Dependency- Visual Studio 2022

  1. go to optimization repo on your local machine.
  2. mkdir bld
  3. cd bld
  4. (inside x64 Native tools command prompt) "cmake -G "Visual Studio 17 2022" .." => this will generate visual studio solution files. open optimization.sln file using visual studio.
  5. To Build- build "ALL_BUILD" project
  6. To Run tests- build "RUN_TESTS" project.

Building the Benchmarks

Benchmark code is supplied for some of the code samples. These benchmarks are built using Google's Benchmark project. If Benchmark is installed and discoverable by CMake, the benchmarks for the code samples will be automatically built when you type make.

In Windows, ensure you build the benchmark code with the same build type (Release/Debug) as Google's Benchmark to prevent debug level mismatch errors while linking.

CPU Requirements

The code samples assume that they are being run on a fourth-generation Intel® Core™ (Haswell) processor or later and do not perform runtime checks for the instructions that they use that are present in fourth-generation Intel® Core™ (Haswell), for example, FMA or AVX-2. Some of the code samples may then crash if they are run on a device that does not support these instructions.

The code samples do however check for post fourth-generation Intel® Core™ (Haswell) instruction sets such as AVX-512 and VNNI before running. Tests will skip if they detect that the post fourth-generation Intel® Core™ (Haswell) instructions they need are not present. Some of the newest examples use new instructions only found in seventh-generation Intel® Core™ (SkylakeX) or later processors. If you have an older CPU in your PC you may find that everything builds on your system but that some of the tests are skipped or crash (if you don't have AVX2) when run. In this case, to fully run the tests, you need to run them under the SDE.

https://software.intel.com/en-us/articles/intel-software-development-emulator

Code Sample Constraints

Many of the code samples in the Optimization Manual are code snippets. They contain the minimum amount of code needed to illustrate a particular concept that is discussed in the manual. The code samples typically make assumptions about the data they process. These assumptions are often not documented in the manual. They are however documented in this repository. Each code sample is implemented as a function and each of these functions is accompanied by a wrapper function that documents and enforces the assumptions of the code sample. For example, for two functions are defined for Chapter 18 example 22

void lookup128_novbmi(const uint8_t *in, uint8_t *dict, uint8_t *out,
		      size_t len);
bool lookup128_novbmi_check(const uint8_t *in, uint8_t *dict, uint8_t *out,
			    size_t len);

lookup128_novbmi corresponds to the code in the Optimization Manual and lookup128_novbmi_check is a wrapper function that checks the validity of its parameters and then calls lookup128_novbmi. The code for lookup128_novbmi_check is as follows.

bool lookup128_novbmi_check(const uint8_t *in, uint8_t *dict, uint8_t *out,
			    size_t len)
{
	/*
	 * in, dict and out must be non-NULL.  dict must contain at least 128
	 * bytes.
	 */

	if (!in || !dict || !out)
		return false;

	/*
	 * len must be > 0 and a multiple of 32.
	 */

	if (len == 0 || len % 32 != 0)
		return false;

	lookup128_novbmi(in, dict, out, len);

	return true;
}

Note how the input constraints are documented and, where possible, enforced.

Register usage

Assembly language code samples in the .s files, that are designed to be compiled by gcc or clang on Linux, contain almost exact copies of the code snippets that appear in the manual. The core of these functions use the same set of registers as used by the corresponding examples in the manual. Sometimes these code samples in the repository contain some additional setup code that ensures that the registers are set up in the way that the code snippets in the manual expect. This setup code is kept to a minimum by carefully choosing the order of the parameters in the prototypes for the code samples. This is why the ordering of the parameters may seem a bit weird and inconsistent from one example to the next. As the MASM versions of the code samples in the .asm files use the same prototypes as the samples in the .s files and as Windows has a different calling convention to Linux, large amounts of setup code would need to appear in the .asm files for the MASM versions of the code samples to use the same set of registers that are used by the code snippets in the manual and the .s files. Consequently, the MASM versions of the code samples, tend to use different sets of registers to keep the setup code to a minimum.

More Repositories

1

hyperscan

High-performance regular expression matching library
C++
4,478
star
2

acat

Assistive Context-Aware Toolkit (ACAT)
C#
3,191
star
3

haxm

Intel® Hardware Accelerated Execution Manager (Intel® HAXM)
C
3,029
star
4

appframework

The definitive HTML5 mobile javascript framework
CSS
2,435
star
5

pcm

Intel® Performance Counter Monitor (Intel® PCM)
C++
2,083
star
6

neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Python
1,939
star
7

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Python
1,910
star
8

intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Python
1,203
star
9

linux-sgx

Intel SGX for Linux*
C++
1,180
star
10

scikit-learn-intelex

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
Python
954
star
11

llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
918
star
12

nemu

ARCHIVED: Modern Hypervisor for the Cloud. See https://github.com/cloud-hypervisor/cloud-hypervisor instead
C
915
star
13

compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
C++
912
star
14

caffe

This fork of BVLC/Caffe is dedicated to improving performance of this deep learning framework when running on CPU, in particular Intel® Xeon processors.
C++
845
star
15

isa-l

Intelligent Storage Acceleration Library
C
816
star
16

media-driver

C
783
star
17

cve-bin-tool

The CVE Binary Tool helps you determine if your system includes known vulnerabilities. You can scan binaries for over 200 common, vulnerable components (openssl, libpng, libxml2, expat and others), or if you know the components used, you can get a list of known vulnerabilities associated with an SBOM or a list of components and versions.
Python
721
star
18

intel-cmt-cat

User space software for Intel(R) Resource Director Technology
C
630
star
19

fastuidraw

C++
603
star
20

libipt

libipt - an Intel(R) Processor Trace decoder library
C
594
star
21

libxcam

libXCam is a project for extended camera(not limited in camera) features and focus on image quality improvement and video analysis. There are lots features supported in image pre-processing, image post-processing and smart analysis. This library makes GPU/CPU/ISP working together to improve image quality. OpenCL is used to improve performance in different platforms.
C++
577
star
22

clDNN

Compute Library for Deep Neural Networks (clDNN)
C++
573
star
23

libva

Libva is an implementation for VA-API (Video Acceleration API)
C
558
star
24

intel-graphics-compiler

C++
503
star
25

wds

Wireless Display Software For Linux OS (WDS)
C++
496
star
26

thermal_daemon

Thermal daemon for IA
C++
485
star
27

x86-simd-sort

C++ header file library for high performance SIMD based sorting algorithms for primitive datatypes
C++
485
star
28

Intel-Linux-Processor-Microcode-Data-Files

466
star
29

gvt-linux

C
463
star
30

kernel-fuzzer-for-xen-project

Kernel Fuzzer for Xen Project (KF/x) - Hypervisor-based fuzzing using Xen VM forking, VMI & AFL
C
441
star
31

tinycbor

Concise Binary Object Representation (CBOR) Library
C
432
star
32

openfl

An open framework for Federated Learning.
Python
427
star
33

cc-oci-runtime

OCI (Open Containers Initiative) compatible runtime for Intel® Architecture
C
415
star
34

tinycrypt

tinycrypt is a library of cryptographic algorithms with a focus on small, simple implementation.
C
373
star
35

compile-time-init-build

C++ library for composing modular firmware at compile-time.
C++
372
star
36

ARM_NEON_2_x86_SSE

The platform independent header allowing to compile any C/C++ code containing ARM NEON intrinsic functions for x86 target systems using SIMD up to SSE4 intrinsic functions
C
369
star
37

yarpgen

Yet Another Random Program Generator
C++
357
star
38

intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Go
356
star
39

QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
C
356
star
40

linux-sgx-driver

Intel SGX Linux* Driver
C
334
star
41

safestringlib

C
328
star
42

xess

C
313
star
43

idlf

Intel® Deep Learning Framework
C++
311
star
44

ad-rss-lib

Library implementing the Responsibility Sensitive Safety model (RSS) for Autonomous Vehicles
C++
298
star
45

intel-vaapi-driver

VA-API user mode driver for Intel GEN Graphics family
C
289
star
46

ipp-crypto

C
269
star
47

rohd

The Rapid Open Hardware Development (ROHD) framework is a framework for describing and verifying hardware in the Dart programming language. ROHD enables you to build and traverse a graph of connectivity between module objects using unrestricted software.
Dart
256
star
48

opencl-intercept-layer

Intercept Layer for Debugging and Analyzing OpenCL Applications
C++
255
star
49

FSP

Intel(R) Firmware Support Package (FSP)
C
244
star
50

dffml

The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.
Python
241
star
51

intel-ipsec-mb

Intel(R) Multi-Buffer Crypto for IPSec
C
238
star
52

userspace-cni-network-plugin

Go
232
star
53

isa-l_crypto

Assembly
232
star
54

confidential-computing-zoo

Confidential Computing Zoo provides confidential computing solutions based on Intel SGX, TDX, HEXL, etc. technologies.
CMake
229
star
55

intel-extension-for-tensorflow

Intel® Extension for TensorFlow*
C++
226
star
56

bmap-tools

BMAP Tools
Python
220
star
57

ozone-wayland

Wayland implementation for Chromium Ozone classes
C++
214
star
58

intel-qs

High-performance simulator of quantum circuits
C++
202
star
59

SGXDataCenterAttestationPrimitives

C++
202
star
60

intel-sgx-ssl

Intel® Software Guard Extensions SSL
C
197
star
61

msr-tools

C
195
star
62

depth-camera-web-demo

JavaScript
194
star
63

CPU-Manager-for-Kubernetes

Kubernetes Core Manager for NFV workloads
Python
190
star
64

rmd

Go
189
star
65

asynch_mode_nginx

C
186
star
66

hexl

Intel®️ Homomorphic Encryption Acceleration Library accelerates modular arithmetic operations used in homomorphic encryption
C++
181
star
67

ros_object_analytics

C++
177
star
68

zephyr.js

JavaScript* Runtime for Zephyr* OS
C
176
star
69

generic-sensor-demos

HTML
175
star
70

ipmctl

C
172
star
71

sgx-ra-sample

C++
171
star
72

lmbench

C
171
star
73

cri-resource-manager

Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies
Go
166
star
74

virtual-storage-manager

Python
164
star
75

PerfSpect

System performance characterization tool based on linux perf
Python
164
star
76

systemc-compiler

This tool translates synthesizable SystemC code to synthesizable SystemVerilog.
C++
155
star
77

webml-polyfill

Deprecated, the Web Neural Network Polyfill project has been moved to https://github.com/webmachinelearning/webnn-polyfill
Python
153
star
78

pmem-csi

Persistent Memory Container Storage Interface Driver
Go
151
star
79

libyami

Yet Another Media Infrastructure. it is core part of media codec with hardware acceleration, it is yummy to your video experience on Linux like platform.
C++
148
star
80

ros_openvino_toolkit

C++
147
star
81

rib

Rapid Interface Builder (RIB) is a browser-based design tool for quickly prototyping and creating the user interface for web applications. Layout your UI by dropping widgets onto a canvas. Run the UI in an interactive "Preview mode". Export the generated HTML and Javascript. It's that simple!
JavaScript
147
star
82

ideep

Intel® Optimization for Chainer*, a Chainer module providing numpy like API and DNN acceleration using MKL-DNN.
C++
145
star
83

libva-utils

Libva-utils is a collection of tests for VA-API (VIdeo Acceleration API)
C
144
star
84

gmmlib

C++
141
star
85

platform-aware-scheduling

Enabling Kubernetes to make pod placement decisions with platform intelligence.
Go
140
star
86

numatop

NumaTOP is an observation tool for runtime memory locality characterization and analysis of processes and threads running on a NUMA system.
C
139
star
87

ros2_grasp_library

C++
138
star
88

XBB

C++
133
star
89

tdx-tools

Cloud Stack and Tools for Intel TDX (Trust Domain Extension)
C
131
star
90

ros2_intel_realsense

This project is deprecated and no more maintained. Please visit https://github.com/IntelRealSense/realsense-ros for ROS2 wrapper.
C++
131
star
91

linux-intel-lts

C
131
star
92

CeTune

Python
130
star
93

cm-compiler

C++
130
star
94

pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
C++
129
star
95

fMBT

Free Model Based tool
Python
129
star
96

zlib

C
128
star
97

ros_intel_movidius_ncs

C++
126
star
98

mpi-benchmarks

C
125
star
99

mOS

C
124
star
100

sgx-software-enable

C
122
star