• Stars
    star
    1,241
  • Rank 37,616 (Top 0.8 %)
  • Language
    C++
  • License
    MIT License
  • Created about 3 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to [email protected]). And, of course, we love testimonials!

-- The ControlFlag Team

linux_build_and_test linux_style_check macos_build_and_test macos_style_check GitHub license

ControlFlag: A Self-supervised Idiosyncratic Pattern Detection System for Software Control Structures

ControlFlag is a self-supervised idiosyncratic pattern detection system that learns typical patterns that occur in the control structures of high-level programming languages, such as C/C++, by mining these patterns from open-source repositories (on GitHub and other version control systems). It then applies learned patterns to detect anomalous patterns in user's code.

Brief technical description

ControlFlag's pattern anomaly detection system can be used for various problems such as typographical error detection, flagging a missing NULL check to name a few. This PoC demonstrates ControlFlag's application in the typographical error detection.

Figure below shows ControlFlag's two main phases: (1) pattern mining phase, and (2) scanning for anomalous patterns phase. The pattern mining phase is a "training phase" that mines typical patterns in the user-provided GitHub repositories and then builds a decision-tree from the mined patterns. The scanning phase, on the other hand, applies the mined patterns to flag anomalous expressions in the user-specified target repositories.

ControlFlag design

More details can be found in our MAPS paper (https://arxiv.org/abs/2011.03616).

Directory structure (evolving)

  • src: Source code for ControlFlag for typographical error detection system
  • scripts: Scripts for pattern mining and scanning for anomalies
  • quick_start: Scripts to run quick start tests
  • github: Scripts and data for downloading GitHub repos.
  • tests: unit tests

Install

ControlFlag can be built on Linux and MacOS.

Requirements

  • CMake 3.4.3 or above
  • C++17 compatible compiler
  • Tree-sitter parser (downloaded automatically as part of cmake)
  • GNU parallel (optional, if you want to generate your own training data)

Tested build configuration on Linux-based systems

  • CentOS-7.6/Ubuntu-20.04 with g++-v10.2.0 for x86_64

Tested build configuration on MacOS

  • MacOS Mojave v10.14.6 with clang-1001.0.46.4 (Apple LLVM version 10.0.1) for x86_64 (obtained from The Command Line Tools Package)

Build

$ cd control-flag
$ cmake .
$ make -j
$ make test

All tests in make test should pass.

Using ControlFlag

Quick start

Using patterns obtained from several GitHub repos to scan repository of your choice

Download the training data for the language of interest depending on the memory constraints of your device. Note, however, that using smaller datasets may lead to reduced accuracy in the results ControlFlag produces and possibly an increase in the number of false positives it generates.

Language Dataset name Size on disk Memory requirements Direct link MD5 checksum
C Small ~100MB ~400MB link 2825f209aba0430993f7a21e74d99889
C Medium ~450MB ~1.3GB link aab2427edebe9ed4acab75c3c6227f24
C Large ~9GB ~13GB link 1ba954d9716765d44917445d3abf8e85
C++ Small ~200MB ~500MB link f954486e20961f0838ac08e5d4dbf312
C++ Medium ~500MB ~1.3GB link a5c18ea1cdbe354b93aabf9ecaa5b07a
C++ Large ~1.2GB ~3GB link 4f5ffc1ab942eaba399cafd5be8bb45f
PHP Small ~120MB ~1GB link 5a1cc4c24a20de7dad1b9f40661d517a
$ Download <tgz_file> from the link above.
$ (optional) md5sum <tgz_file>
$ tar -zxf <tgz_file>

To scan C code of your choice, use below command:

$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files> -l 1

To scan C++ code of your choice, use below command:

$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files> -l 4

Once the run is complete (which could take some time depending on your system and the number of programs from your repository that can be scanned by ControlFlag,) refer to the section below to understand scan output.

Mining patterns from a small repo and applying them to another small repo

In this test for C language programs, we will mine patterns from Glb-director project of GitHub and apply them to flag anomalies in GitHub's brubeck project.

Simply run below command:

cd quick_start && ./test1_c.sh

If everything goes well, you can see output from the scanner in test1_scan_output directory. Look for "Potential anomaly" label in it by grep "Potential anomaly" -C 5 \*.log, and you should see output like below:

thread_6.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (identifier) (non_terminal_expression))) found in training dataset:
Source file: brubeck/src/server.c:266:5:(s == sizeof(fdsi))
thread_6.log-Autocorrect search took 0.000 secs
thread_6.log:Potential anomaly
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 1
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (null))) with editing cost:1 and occurrences: 25
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (identifier))) with editing cost:1 and occurrences: 5
thread_6.log-Did you mean:(parenthesized_expression (binary_expression (">=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 3
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (non_terminal_expression))) with editing cost:1 and occurrences: 2

The anomaly is flagged for brubeck/src/server.c at line number 266.

Detailed steps

  1. Pattern Mining phase (if you want to generate training data yourself)

If you do not want to generate training data yourself, go to Evaluation step below.

In this phase, we mine the idiosyncratic patterns that appear in the control structures of high-level language such as C. This PoC mines patterns from if statements that appear in C programs.

If you want to use your own repository for mining patterns, jump to Step 1.2.

1.1 Downloading GitHub repos for C language having more than 100 stars

Steps below show how to download GitHub repos for C language that have more than 100 stars (c100.txt) and generate training data. training_repo_dir is a directory where the command below will clone all the repos.

$ cd github
$ python download_repos.py -f c100.txt -o <training_repo_dir> -m clone -p 5

1.2 Mining patterns from downloaded repositories

You can use your own repository to mine for expressions by passing it in place of <training_repo_dir>.

mine_patterns.sh script helps for this. It's usage is as below:

Usage: ./mine_patterns.sh -d <directory_to_mine_patterns_from> -o <output_file_to_store_training_data>
Optional:
[-n number_of_processes_to_use_for_mining]  (default: num_cpus_on_system)
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)
[-g github_repo_id] (default: 0) A unique identifier for GitHub repository, if any

We use it as:

$ scripts/mine_patterns.sh -d <training_repo_dir> -o <training_data_file> -l 1

<training_dat_file> contains conditional expressions in C language that are found in the specified GitHub repos and their AST (abstract syntax tree) representations. You can view this file as a text file, if you want.

Evaluation (or scanning for anomalies)

We can run scan_for_anomalies.sh script to scan target directory of interest. Its usage is as below.

Usage: ./scan_for_anomalies.sh -t <training_data> -d <directory_to_scan_for_anomalous_patterns>
Optional:
 [-c max_cost_for_autocorrect]              (default: 2)
 [-n max_number_of_results_for_autocorrect] (default: 5)
 [-j number_of_scanning_threads]            (default: num_cpus_on_systems)
 [-o output_log_dir]                        (default: /tmp)
 [-l source_language_number]                (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++))
 [-a anomaly_threshold]                     (default: 3.0)

As a part of scanning for anomalies, ControlFlag also suggests possible corrections in case a conditional expression is flagged as an anomaly. 25 is the max_cost for the correction -- how close should the suggested correction be to possibly mistyped expression. Increasing max_cost leads to suggesting more corrections. If you feel that the number of reported anomalies is high, consider reducing anomaly_threshold to 1.0 or less.

Understanding scan output

Under output_log_dir you will find multiple log files corresponding to the scan output from different scanner threads. Potential anomalies are reported with "Potential anomaly" as a label. Command below will report log files containing at least one anomaly.

$ grep "Potential anomaly" <output_log_dir>/thread_*.log

A sample anomaly report looks like below:

Level:<ONE or TWO> Expression: <AST_for_anomalous_expression>
Source file and line number: <Source code expression with line number having the anomaly>
Potential anomaly
Did you mean ...

The text after "Did you mean" shows possible corrections to the anomalous expression.

Success stories

In the spirit of community service, we routinely scan open-source packages using ControlFlag. We have found several programming errors in various open-source projects. We are mentioning some of the errors that are confirmed by the respective developers below.

Issue link Language Erroneous expression Comment
curl/curl#6193 C if (s->keepon > TRUE) Comparison between a variable and a boolean using >
vrpn/vrpn#263 C (l_inbuf[2] | 1), if (l_inbuf[3] | 1) Incorrect use of | instead of &
vlm/asn1c#443 C if(!saved_aid && 0) Dead code
shoes/shoes3#468 C if ((attr == 39) || (attr = 49)) Incorrect use of = instead of ==
IoLanguage/io#455 C if (UArray_greaterThan_(self, other) | UArray_equals_(self, other)) Inefficient use of | instead of ||
IoLanguage/io#455 C if( ln = (SFG_Node *)node->Next ), if( ln = (SFG_Node *)node->Prev ) Missing parenthesis
elua/elua#170 C if (Protection_Level_1_Register &= FMI_Sector_Mask) Missing parenthesis

More Repositories

1

distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Jupyter Notebook
4,332
star
2

nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Python
2,936
star
3

coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Python
2,321
star
4

fastRAG

Efficient Retrieval Augmentation and Generation Framework
Python
1,194
star
5

flrc

Haskell Research Compiler
Standard ML
814
star
6

RiverTrail

An API for data parallelism in JavaScript
JavaScript
748
star
7

kAFL

A fuzzer for full VM kernel/driver targets
Makefile
636
star
8

bayesian-torch

A library for Bayesian neural network layers and uncertainty estimation in Deep Learning extending the core of PyTorch
Python
503
star
9

academic-budget-bert

Repository containing code for "How to Train BERT with an Academic Budget" paper
Python
308
star
10

ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
Julia
294
star
11

RAGFoundry

Framework for enhancing LLMs for RAG tasks using fine-tuning.
Python
289
star
12

SkimCaffe

Caffe for Sparse Convolutional Neural Network
C++
238
star
13

pWord2Vec

Parallelizing word2vec in shared and distributed memory
C++
191
star
14

causality-lab

Causal discovery algorithms and tools for implementing new ones
Jupyter Notebook
167
star
15

Model-Compression-Research-Package

A library for researching neural networks compression and acceleration methods.
Python
134
star
16

matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
Python
134
star
17

riscv-vector

Vector Acceleration IP core for RISC-V*
Scala
131
star
18

IntelNeuromorphicDNSChallenge

Intel Neuromorphic DNS Challenge
Jupyter Notebook
126
star
19

MMPano

Official implementation of L-MAGIC
Python
122
star
20

rnnlm

Recurrent Neural Network Language Modeling (RNNLM) Toolkit
C++
121
star
21

HPAT.jl

High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters.
Julia
120
star
22

FP8-Emulation-Toolkit

PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
Python
90
star
23

ScalableVectorSearch

C++
88
star
24

VL-InterpreT

Visual Language Transformer Interpreter - An interactive visualization tool for interpreting vision-language transformers
Python
84
star
25

vdms

VDMS: Your Favorite Visual Data Management System
C++
82
star
26

SpMP

sparse matrix pre-processing library
C++
81
star
27

SLIDE_opt_ia

C++
74
star
28

CLNeRF

Python
63
star
29

baa-ngp

This repository contains the official Implementation for "BAA-NGP: Bundle-Adjusting Accelerated Neural Graphics Primitives".
Python
56
star
30

autonomousmavs

Framework for Autonomous Navigation of Micro Aerial Vehicles
C++
56
star
31

Latte.jl

A high-performance DSL for deep neural networks in Julia
Julia
52
star
32

AVUC

Code to accompany the paper 'Improving model calibration with accuracy versus uncertainty optimization'.
Python
51
star
33

multimodal_cognitive_ai

research work on multimodal cognitive ai
Python
51
star
34

GraVi-T

Graph learning framework for long-term video understanding
Python
49
star
35

PreSiFuzz

Pre-Silicon Hardware Fuzzing Toolkit
Rust
47
star
36

pmgd

Persistent Memory Graph Database
C++
43
star
37

TSAD-Evaluator

Intel Labs open source repository for time series anomaly detection evaluator
C++
41
star
38

Open-Omics-Acceleration-Framework

Intel lab's open sourced data science framework for accelerating digital biology
Jupyter Notebook
36
star
39

Auto-Steer

Auto-Steer
Python
36
star
40

FloorSet

Jupyter Notebook
34
star
41

SAR

Python
34
star
42

kafl.fuzzer

kAFL Fuzzer
Python
32
star
43

CompilerTools.jl

The CompilerTools package, part of the High Performance Scripting project at Intel Labs
Julia
30
star
44

TinyGarble2.0

C++
29
star
45

t2sp

Productive and portable performance programming across spatial architectures (FPGAs, etc.) and vector architectures (GPUs, etc.)
C++
29
star
46

DyNAS-T

Dynamic Neural Architecture Search Toolkit
Jupyter Notebook
28
star
47

ParallelJavaScript

A collection of example workloads for Parallel JavaScript
HTML
26
star
48

kafl.targets

Target components for kAFL/Nyx Fuzzer
C
25
star
49

continuallearning

Python
25
star
50

iHRC

Intel Heterogeneous Research Compiler (iHRC)
C++
25
star
51

scenario_execution

Scenario Execution for Robotics
Python
25
star
52

flrc-lib

Pillar compiler, Pillar runtime, garbage collector.
C++
23
star
53

lvlm-interpret

Python
23
star
54

iACT

C++
22
star
55

OSCAR

Object Sensing and Cognition for Adversarial Robustness
Jupyter Notebook
20
star
56

mat2qubit

Python
19
star
57

MICSAS

MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure
Python
19
star
58

csg

IV 2020 "CSG: Critical Scenario Generation from Real Traffic Accidents"
Python
18
star
59

Sparso

Julia package for accelerating sparse matrix applications.
Julia
18
star
60

open-omics-alphafold

Python
17
star
61

MART

Modular Adversarial Robustness Toolkit
Python
16
star
62

Trans-Omics-Acceleration-Library

HTML
15
star
63

Hardware-Aware-Automated-Machine-Learning

Jupyter Notebook
15
star
64

kafl.linux

Linux kernel branches for confidential compute research
15
star
65

c3-simulator

C3-Simulator is a Simics-based functional simulator for the X86 C3 processor, including library and kernel support for pointer and data encryption, stack unwinding support for C++ exception handling, debugger enabling, and scripting for running tests.
C++
14
star
66

VectorSearchDatasets

Python
11
star
67

ais-benchmarks

A framework, based on python and numpy, for evaluation of sampling methods
Python
10
star
68

ALTO

A template-based implementation of the Adaptive Linearized Tensor Order (ALTO) format for storing and processing sparse tensors.
C++
10
star
69

flrc-benchmarks

Benchmarks for use with IntelLabs/flrc.
Haskell
10
star
70

hec-p-isa-tools

Intel’s HERACLES accelerator introduces a new set of fundamental instructions, the Polynomial Instructions Set Architecture (P-ISA) that operates directly on polynomials requiring a completely new programming environment. This open-source project aims at developing the building blocks for a compiler toolchain for HERACLES.
Python
10
star
71

PyTorchALFI

Application Level Fault Injection for Pytorch
Python
9
star
72

RiverTrail-interactive

An interactive shell in your browser for writing and running River Trail programs
JavaScript
8
star
73

gma

Linux Client & Server Software to support Generic Multi-Access Network Virtualization
C++
8
star
74

dfm

DFM (Deep Feature Modeling) is an efficient and principled method for out-of-distribution detection, novelty and anomaly detection.
Python
7
star
75

SOI_FFT

Segment-of-interest low-communication FFT algorithm
C
7
star
76

vcl

DEPRECATED - No longer maintained. Updates are will be provided through the VDMS project
C++
6
star
77

DATSA

DATSA
C++
6
star
78

Hybrid-Quantum-Classical-Library

Hybrid Quantum-Classical Library (HQCL)
C++
6
star
79

spic

Semantic Preserving Image Compression
Python
6
star
80

generative-ai

Intel Generative Image Model Benchmark
Jupyter Notebook
6
star
81

Optimized-Implementation-of-Word-Movers-Distance

C++
6
star
82

token_elimination

Python
6
star
83

NeuroCounterfactuals

Jupyter Notebook
5
star
84

c3-glibc

C
5
star
85

PolarFly

Source code repository for paper being presented at Super Computing 22 Conference.
C++
5
star
86

aspect-extraction

Pattern Based Aspect Term Extraction
Python
5
star
87

networkgym

NetworkGym is a Simulation-aaS framework to support Network AI algorithm development by providing high-fidelity full-stack e2e network simulation in cloud and allowing AI developers to interact with the simulated network environment through open APIs.
C++
5
star
88

Latte.py

Python
5
star
89

HDFIT

HDFIT (Hardware Design Fault Injection Toolkit) Github documentation pages.
5
star
90

TME-MK-Fine-Grained-Encryption-Integrity

Makefile
5
star
91

EquiTriton

EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference.
Python
4
star
92

Incremental-Neural-Videos-with-PyTorch

Incremental-Neural-Videos-with-PyTorch*
Python
4
star
93

kafl.qemu

4
star
94

simics-plus-rtl

This project contains the Chisel code for a CRC32 datapath alongside a skeleton PCI component in Simics DML which connects to the C++ conversion of the CRC32 datapath.
Scala
4
star
95

Chisel-cocotb-Examples

This project contains generic example hardware modules and their testbenches written in Chisel and cocotb to demonstrate an agile hardware development methodology.
Python
4
star
96

LogReplicationRocksDB

C++
4
star
97

emp-ot

C++
3
star
98

kafl.libxdc

C
3
star
99

kafl.actions

Github actions for KAFL
Python
3
star
100

emp-tool

C++
3
star