• Stars
    star
    196
  • Rank 198,553 (Top 4 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An extensively configurable tool providing a summary of the changes between two files or directories, ignoring all the fluff you don't care about.

Diffware

The goal of this tool is to provide a summary of the changes between two files or directories. It can be extensively configured to keep only the changes that matter to you, and be combined with tools like diffoscope to dive into those differences.

Diffware CLI example Diffoscope example Example usage of diffware combined with diffoscope


Checkout this file for a use-case example and an overview of the tool's capabilities.

Table of content

  1. Installing
  2. Usage
  3. Configuration
  4. Optimizing
  5. Tools
  6. Examples

Installing

Python 3.8 or newer is recommended.

Minimal

The minimal install doesn't allow for automatic file extraction, but can work on already extracted files and directories.

Simply setup a virtual environments and install requirements through pip:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Extended signatures

Optionally, you can install fact_helper_file, which provides filemagick with custom signatures. If available, this module will be used instead of python-magic.

Install the latest version from Github:

git clone https://github.com/fkie-cad/fact_helper_file.git
cd fact_helper_file
pip3 install .

Full

The full install adds an automatic extraction tool.

Install fact_extractor:

git clone https://github.com/fkie-cad/fact_extractor.git ~/fact_extractor
cd ~/fact_extractor
fact_extractor/install/pre_install.sh
fact_extractor/install.py

Then, setup a virtual environments and install requirements through pip:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

usage: main.py [-h] [-o DATA_FILE] [-L {DEBUG,INFO,WARNING,ERROR}] [-d] [-C CONFIG_FILE] [-j JOBS] [--exclude GLOB_PATTERN] [--exclude-mime GLOB_PATTERN] [--blacklist MIME_TYPE]
               [--fuzzy-threshold FUZZY_THRESHOLD] [--max_depth MAX_DEPTH] [--no-extract] [--no-specialize] [--no-distance] [--order-by {none,path,distance}] [--min_dist MIN_DIST]
               [--binutils-prefix BINUTILS_PREFIX] [--no-progress] [--clean-extracted] [--enable-statistics] [--profile]
               FILE_PATH_1 FILE_PATH_2

positional arguments:
 FILE_PATH_1           Path to first file
 FILE_PATH_2           Path to second file

optional arguments:
 -h, --help            show this help message and exit
 -o DATA_FILE, --output DATA_FILE
                       Path to file in which to write the list of files (- for stdout)
 -L {DEBUG,INFO,WARNING,ERROR}, --log_level {DEBUG,INFO,WARNING,ERROR}
                       Define the log level
 -d, --debug           Print debug messages
 -C CONFIG_FILE, --config_file CONFIG_FILE
                       Path to config File
 -j JOBS, --jobs JOBS  Number of job to run in parallel (default is number of cpus)
 --exclude GLOB_PATTERN
                       Exclude files paths that match GLOB_PATTERN.
 --exclude-mime GLOB_PATTERN
                       Exclude files with mime types that match GLOB_PATTERN.
 --blacklist MIME_TYPE
                       Don't attempt to extract files that match MIME_TYPE (unused when combined with --no-extract).
 --fuzzy-threshold FUZZY_THRESHOLD
                       Threshold for fuzzy-matching to detect moved files (<= 0 to disable, default is 80)
 --max_depth MAX_DEPTH
                       Maximum depth for recursive unpacking (< 0 for no limit, default is 8)
 --no-extract          Consider all files are already extracted, and only compare them
 --no-specialize       Do not use specific content comparison for known file types, but use simple binary data comparison
 --no-distance         Disable computing the distance between two modified files using TLSH
 --order-by {none,path,distance}
                       Define the sort order for the output. Note: setting this to anything other than "none" will disable progressive output
 --min_dist MIN_DIST   Ignore files with a difference lower than the one given (< 0 for no limit)
 --binutils-prefix BINUTILS_PREFIX
                       Prefix for binutils program names (for example, "aarch64-linux-gnu-").
 --no-progress         Hide progress messages
 --clean-extracted     Delete temporary container files which have been extracted
 --enable-statistics   Compute statistics or check for unpack data loss
 --profile             Measure the number of calls and time spent in different methods

Configuration

Most parameters can be set from the CLI and using the config file (see config.cfg for an example).

While settings in the diff section are specific to this tool, the ones in the unpack and ExpertSettings are shared with fact_extractor, so you should check out their documentation.

Here's a list of options that can be set in the config file:

diff section

Option name Default value Description
data_file - Path to file in which to write the list of files (- for stdout)
debug False Print debug messages
log_level "INFO" Define the log level
jobs <cpu_count> Number of job to run in parallel
exclude_mime [] Exclude files with mime types that match the given glob pattern
fuzzy_threshold 80 Threshold for fuzzy-matching to detect moved files (<= 0 to disable)
max_depth 8 Maximum depth for recursive unpacking (< 0 for no limit)
extract True Whether to try to unpack files
specialize True Whether to use file-specific comparison (if False, always compare file binary data)
compute_distance True Whether to compute the distance between two modified files using TLSH
sort_order "none" Define the sort order for the output
min_dist -1 Ignore files with a difference lower than the one given (< 0 for no limit)
binutils_prefix "" Prefix for binutils program names (for example, "aarch64-linux-gnu-")
show_progress True Whether to output progress messages in the console or not
clean_extracted False Delete temporary container files which have been extracted
profile False Whether to measure the number of calls and time spent in different methods

unpack section

Option name Default value Description
exclude [] Exclude files with paths that match the given glob pattern
blacklist [] Don't attempt to unpack files with the given mime-types
data_folder_1 /tmp/extractor1 Folder in which to unpack the data of the first file
data_folder_2 /tmp/extractor2 Folder in which to unpack the data of the second file
statistics False Whether fact_extractor should compute statistics after extracting files

ExpertSettings section

Option name Default value Description
statistics False Whether fact_extractor should compute statistics after extracting files
unpack_threshold 0.8 Threshold to detect data loss when unpacking
header_overhead 256 Size of header for unpacked data, used to detect data loss
compressed_file_types [] List of files used when computing statistics to know whether data was lost

Optimizing

Extracting

For faster analysis, you should try to avoid extracting files on every run by using the --no-extract option. Since the tool can work on directories, you can either manually extract the content beforehand, or run the script once and then run it again on the extracted folder.

Specializing

Some types of files have specific comparing mechanisms to make the output more robust. As this can add significant overhead, they can be disabled using the --no-specialize option.

Disabling this option has the side effect of making the comparison tool follow symlinks. Though it shouldn't fail regardless of what the link points to, it may result in symlinks being reported as different and timeouts being shown while reading from them. In that case, you may want to ignore symlinks by using the --exclude-mime inode/symlink option.

Ignoring files

You should also try to exclude as many files as possible, either based on their mime-type:

--exclude-mime "audio/*" --exclude-mime "image/*" --exclude-mime "video/*"

... or based on their path:

--exclude "*/build/*" --exclude "*.txt" --exclude "*.json"

You can also tweak the blacklist option from the config file to prevent unpacking attempts of known mime-types for which it's unnecessary.

Saving time for moved detection

If folders have been renamed (apart from the root file), try renaming them back to their old name so the overall hierarchy of both files match. Otherwise, many files will have to be compared to attempt to detect the ones that have been moved.

Tools

Diffoscope

The output of this script can be parsed to run diffoscope on the identified changes:

./tools/diffoscope.py path-to-output-diff

Any option other than the path to the file will be passed to diffoscope. When possible, the modified files won't be copied, but a hardlink will be created in a temporary folder.

You can also use the elf.py and decompile.py files from the tools folder with recent versions of diffoscope to reduce noise in the comparisons.

Examples

OpenWRT

Let's say we want to find out what changes have been made between two firmware versions, to know if some features have been added or some vulnerabilities have been patched. In this example, we'll work with two releases of OpenWRT. Though the source code is publicly available, it serves as a useful illustration of how this tool can be used.

Here's the result of comparing the rootfs-squashfs.img.gz of versions 19.07.2 and 19.07.3 for the x86-64 architecture:

$ ./main.py ~/openwrt-19.07.2-x86-64-rootfs-squashfs.img.gz ~/openwrt-19.07.3-x86-64-rootfs-squashfs.img.gz --output /dev/null
[WARNING] Found 2250 files with different paths (and 0 with similar paths), looking for moved files may take a while. Did a folder name change?                                               

As you can see, the files have been decompressed and the squashfs filesystem read automatically by fact_extractor. The extracted files should be available in /tmp/extractor1/files and /tmp/extractor2/files. However, a warning shows that no files with similar paths have been found.

This is because the folder extracted from the archive contains the version number. Thankfully, this is easy to fix. Let's just run the script again on the extracted subfolders, which have the same hierachy:

$ mv /tmp/extractor1/files/openwrt-19.07.2-x86-64-rootfs-squashfs.img_extracted ~/openwrt-19.07.2-x86-64-rootfs-squashfs
$ mv /tmp/extractor2/files/openwrt-19.07.3-x86-64-rootfs-squashfs.img_extracted ~/openwrt-19.07.3-x86-64-rootfs-squashfs
$ ./main.py ~/openwrt-19.07.2-x86-64-rootfs-squashfs ~/openwrt-19.07.3-x86-64-rootfs-squashfs --no-extract
Found 9 added files, 0 removed files and 267 changed files (276 files in total)

Much better! When looking at the output, we notice quite a few images, which we'd like to exclude. We can run the script again:

$ ./main.py ~/openwrt-19.07.2-x86-64-rootfs-squashfs ~/openwrt-19.07.3-x86-64-rootfs-squashfs --no-extract --exclude-mime "image/*"
Found 10 added files, 0 removed files and 241 changed files (251 files in total)

Once again, better. There are some changes related to package versions, we can also decide to exclude them:

$ ./main.py ~/openwrt-19.07.2-x86-64-rootfs-squashfs ~/openwrt-19.07.3-x86-64-rootfs-squashfs --no-extract --exclude-mime "image/*" --exclude "*.control"
Found 10 added files, 0 removed files and 134 changed files (144 files in total)

Now that we're happy with the output, we can save it to a file and run diffoscope to dive into the changes:

$ ./main.py ~/openwrt-19.07.2-x86-64-rootfs-squashfs ~/openwrt-19.07.3-x86-64-rootfs-squashfs --no-extract --exclude-mime "image/*" --exclude "*.control" --output ~/openwrt-19.07.2_vs_19.07.3.diff
$ ./tools/diffoscope.py ~/openwrt-19.07.2_vs_19.07.3.diff --html-dir ~/openwrt-diff --exclude-command "^stat .*"

Note: The --exclude-command option of diffoscope is not mandatory, but it makes the output less noisy. --diff-mask can also prove quite useful to ignore versions strings or dates for example.

In the end, we have obtained:

  • A list of files containing only the differences that matter to our use-case,
  • A quicker look at their content by running diffoscope on this script's output,
  • A set of options that can be turned into a config file and later reused for other versions of OpenWRT so this work doesn't have to be done each time.

FRRouting

A use case example can be found in the doc folder. It shows how to use both this tool and diffoscope to identify a vulnerability fix in an upgrade.

More Repositories

1

bincat

Binary code static analyser, with IDA integration. Performs value and taint analysis, type reconstruction, use-after-free and double-free detection
OCaml
1,662
star
2

qemu_blog

A series of posts about QEMU internals:
1,345
star
3

cpu_rec

Recognize cpu instructions in an arbitrary binary file
Python
640
star
4

ilo4_toolbox

Toolbox for HPE iLO4 & iLO5 analysis
Python
412
star
5

warbirdvm

An analysis of the Warbird virtual-machine protection for the CI!g_pStore
Ruby
216
star
6

gustave

GUSTAVE is a fuzzing platform for embedded OS kernels. It is based on QEMU and AFL (and all of its forkserver siblings). It allows to fuzz OS kernels like simple applications.
Python
194
star
7

powersap

Powershell SAP assessment tool
PowerShell
187
star
8

crashos

A tool dedicated to the research of vulnerabilities in hypervisors by creating unusual system configurations.
C
182
star
9

c-compiler-security

Security-related flags and options for C compilers
Python
179
star
10

ramooflax

a bare metal (type 1) VMM (hypervisor) with a python remote control API
C
178
star
11

bta

Open source Active Directory security audit framework.
Python
131
star
12

android_emuroot

Android_Emuroot is a Python script that allows granting root privileges on the fly to shells running on Android virtual machines that use google-provided emulator images called Google API Playstore, to help reverse engineers to go deeper into their investigations.
Python
121
star
13

AutoResolv

Python
71
star
14

elfesteem

ELF/PE/Mach-O parsing library
Python
50
star
15

GEA1_break

Implementation of the key recovery attack against GEA-1 keys (Eurocrypt 2021)
C
47
star
16

airbus-seclab.github.io

Conferences, tools, papers, etc.
43
star
17

AFLplusplus-blogpost

Blogpost about optimizing binary-only fuzzing with AFL++
Shell
34
star
18

nbutools

Tools for offensive security of NetBackup infrastructures
Python
30
star
19

rebus

REbus facilitates the coupling of existing tools that perform specific tasks, where one's output will be used as the input of others.
Python
25
star
20

usbq_core

USB man in the middle linux kernel driver
C
19
star
21

AppVsWild

application process protection hypervisor virtualization encryption
9
star
22

gunpack

Generic unpacker (dynamic)
C
8
star
23

usbq_userland

User land program to be used with usbq_core
Python
8
star
24

ramooflax_scripts

ramooflax python scripts
Python
6
star
25

cpu_doc

Curated set of documents about CPU
3
star
26

c2newspeak

C
3
star
27

rebus_demo

REbus demo agents
Python
2
star
28

security-advisories

2
star
29

pwnvasive

semi-automatic discovery and lateralization
Python
1
star
30

pok

forked from pok-kernel/pok
C
1
star
31

afl

Airbus seclab fork of AFL
C
1
star