• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language
  • Created almost 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A guide on how to write fast and memory friendly YARA rules

YARA Performance Guidelines

When creating your rules for YARA keep in mind the following guidelines in order to get the best performance from them. This guide is based on ideas and recommendations by Victor M. Alvarez and WXS.

  • Revision 1.5, February 2021, applies to all YARA versions higher than 3.7

The Basics

To get a better grip on what and where YARA performance can be optimized, it's useful to understand the scanning process. It's basically separated into 4 steps which will be explained very simplified using this examples rule:

import "math"
rule example_php_webshell_rule
{
    meta:
        description = "Just an example php webshell rule"
        date = "2021/02/16"
    strings:
        $php_tag = "<?php"
        $input1   = "GET"
        $input2   = "POST"
        $payload = /assert[\t ]{0,100}\(/
    condition:
        filesize < 20KB and
        $php_tag and
        $payload and
        any of ( $input* ) and
        math.entropy(500, filesize-500) >= 5
}

1. Compiling the rules

This step happens before the actual scan. YARA will look for so called atoms in the search strings to feed the Aho-Corasick automaton. The details are explained in the chapter atom but for now it's enough to know, that they're maximum 4 bytes longs and YARA picks them quite cleverly to avoid too many matches. In our example YARA might pick the following 4 atoms:

  • <?ph
  • GET
  • POST
  • sser (out of assert)

2. Aho-Corasick automaton

Here the scan has started. Steps 2.-4. will be executed on all files. YARA will look in each file for the 4 atoms defined above with prefix tree called Aho-Corasick automaton. Any matches are handed over to the bytecode engine.

3. Bytecode engine

If there's e.g. a match on sser, YARA will check if it was prefixed by an a and continues with a t. If that is true, it will follow on with the regex [\t ]{0,100}\(. With this clever approach YARA avoids going with a slow regex engine over the complete files and just picks certain parts to look closer.

4. Conditions

After all pattern matching is done, the conditions are checked. YARA has another optimization mechanism to only do the CPU intense math.entropy check from our example rule, if the 4 conditions before it are satisfied. Explained in more details in the chapter Conditions and Short-Circuit Evaluation

If the conditions are satisfied, a match is reported. The scan continues with the next file in step 2.

Atoms

YARA extracts from the strings short substrings up to 4 bytes long that are called "atoms". Those atoms can be extracted from any place within the string, and YARA searches for those atoms while scanning the file, if it finds one of the atoms then it verifies that the string actually matches.

For example, consider this strings:

/abc.*cde/

=> possible atoms are abc and cde, either one or the other can be used The abc atom is currently preferred because they have the same quality and it is the first of the two.

/(one|two)three/

=> possible atoms are one, two, thre and hree, we can search for thre (or hree) alone, or for both one and two. Atom thre is preferred because it will lead to less potential matches then one and two (these are shorter) and it does not contain double e (more unique letter the better).

YARA does its best effort to select the best atoms from each string, for example:

{ 00 00 00 00 [1-4] 01 02 03 04 }

=> here YARA uses the atom 01 02 03 04, because 00 00 00 00 is too common

{ 01 02 [1-4] 01 02 03 04 }

=> 01 02 03 04 is preferred over 01 02 because it's longer

So, the important point is that strings should contain good atoms. These are bad strings because they contain either too short or too common atoms:

{00 00 00 00 [1-2] FF FF [1-2] 00 00 00 00}
{AB  [1-2] 03 21 [1-2] 01 02}
/a.*b/
/a(c|d)/

The worst strings are those that don't contain any atoms at all, like:

/\w.*\d/
/[0-9]+\n/

This regular expression don't contain any fixed substring that can be used as atom, so it must be evaluated at every offset of the file to see if it matches there.

Too Many Loop Iterations

Another good import recommendation is to avoid for loops with too many iterations, specially of the statement within the loop is too complex, for example:

strings:
	$a = {00 00}
condition:
	for all i in (1..#a) : (@a[i] < 10000)

This rule has two problems. The first is that the string $a is too common, the second one is that because $a is too common #a can be too high and can be evaluated thousands of times.

This other condition is also inefficient because the number of iterations depends on filesize, which can be also very high:

for all i in (1..filesize) : ($a at i)

Magic Module

Avoid using the "magic" module which is not available on the Windows platform. Using the "magic" module slows down scanning but provides exact matches.

Custom GIF magic header definition:

rule gif_1 {
  condition:
    (uint32be(0) == 0x47494638 and uint16be(4) == 0x3961) or
    (uint32be(0) == 0x47494638 and uint16be(4) == 0x3761)
}

Using the "magic" module:

import "magic"
rule gif_2 {
  condition:
    magic.mime_type() == "image/gif"
}

Too Short Strings

Avoid defining too short strings. Any string with less than 4 bytes will probably appear in a lot of files OR as uniform content in an XORed file.

Uniform Content

Some strings are long enough but shouldn't be used due to a different reason - uniformity. These are some examples for strings that shouldn't be used as they could cause too many matches in files.

$s1 = "22222222222222222222222222222222222222222222222222222222222222"
$s2 = "\x00\x20\x00\x20\x00\x20\x00\x20\x00\x20\x00\x20\x00\x20"  // wide formatted spaces

Error message would look like:

error scanning yara-killer.dat: string "$mz" in rule "shitty_mz" caused too many matches

String Advices

Try to describe string definitions as narrow as possible. Avoid the "nocase" attribute if possible, because many atoms will be generated and searched for (higher memory usage, more iterations). Remember, in the absence of modifiers "ascii" is assumed by default. The possible combinations are:

LOW - only one atom is generated

$s1 = "cmd.exe"		       // (ascii only)
$s2 = "cmd.exe" ascii          // (ascii only, same as $s1)
$s3 = "cmd.exe" wide           // (UTF-16 only)
$s4 = "cmd.exe" ascii wide     // (both ascii and UTF-16) two atoms will be generated 
$s5 = { 63 6d 64 2e 65 78 65 } // ascii char code in hex

HIGH - All combinations of upper and lowercase letters for the 4 bytes chosen by YARA will be generated as atoms

$s5 = "cmd.exe" nocase      (all different cases, e.g. "Cmd.", "cMd.", "cmD." ..)

If you want to match scripting commands, check if the language is case insensitive at all (e.g. php, Windows batch) before using nocase. If you just need different casing for just one or two letters, you're better off with a regex, e.g.

$re = /[Pp]assword/

Be careful when working with alternation such as:

$re = /(a|b)cde/
$hex = {C7 C3 00 (31 | 33)}

These strings generate short atoms that can slow down scanning. In cases where there are a small numbers of variant, is it recommended to write the string separately:

$re1 = /acde/
$re2 = /bcde/
$hex1 = {C7 C3 00 31}
$hex2 = {C7 C3 00 33}

Regular Expressions

Use regular expressions only when necessary. Regular expression evaluation is inherently slower than plain string matching and consumes a significant amount of memory. Don't use them if hex strings with jumps and wild-cards can solve the problem.

If you have to use regular expressions avoid greedy .* and even reluctant quantifiers .*?. Instead use exact numbers like .{1,30} or even .{1,3000}. Also, do not forget the upper bound (avoid e.g. .{2,}).

When we are using quantifiers, two situations can happen:

If the beginning of the regular expressions is anchored on one position and the only suffix can vary, YARA will match the longest possible match. In cases as .* and .+ or .{2,}, this can lead to large strings and slowing down scanning problems.

If there are more possible beginnings of the regular expression, YARA will match all of them.

$re1 = /Tom.{0,2}/		// will find Tomxx in "Tomxx"
$re2 = /.{0,2}Tom/      // will find Tom, xTom, xxTom in "xxTom"

The number of shorter matches can easily cross the limit and create "too many matches" error.

The following example is the regular expression for an e-mail address. When using [-a-z0-9._%+] with quantifiers, YARA will match one address multiple times, which is not ideal. In this case, it is recommended to find a reasonably small subset of addresses providing enough information for analysis.

USE

/[-a-z0-9._%+]@[-a-z0-9.]{2,10}\.[a-z]{2,4}/
OR
/@[-a-z0-9.]{2,10}\.[a-z]{2,4}/ 

AVOID

/[-a-z0-9._%+]*@[-a-z0-9.]{2,10}\.[a-z]{2,4}/
/[-a-z0-9._%+]+@[-a-z0-9.]{2,10}\.[a-z]{2,4}/
/[-a-z0-9._%+]{x,y}@[-a-z0-9.]{2,10}\.[a-z]{2,4}/

If you want to make sure, that e.g. exec is followed by /bin/sh, you can use the offsets supplied by the @ symbol. This would be the slow regex version:

$ = /exec.*\/bin\/sh/

This is the faster offset way:

strings:
  $exec = "exec" 
  $sh   = "/bin/sh"
conditions:
  $exec and $sh and
  @exec < @sh

Also try to include long sequences of strings that could serve as anchors in the matching progress. Again, the longer the better.

BAD

$s1 = /http:\/\/[.]*\.hta/	// greedy [.]*

BETTER

$s1 = /http:\/\/[a-z0-9\.\/]{3,70}\.hta/ 	// better, with an the upper bound

BEST

$s1 = /mshta\.exe http:\/\/[a-z0-9\.\/]{3,70}\.hta/

Too Many Matches and Slowing Down Scanning Error

Too many matches errors are caused by too general strings that are present in the input too often, or YARA is matching one instance multiple times.

Slowing down scanning is caused by strings that are generating too short atoms, or non at all. As a result, YARA uses a naïve pattern matching algorithm, which is causing the slowdown.

Both of these problems can be, in some cases, fixed by these steps:

  1. Check for the quantifiers .* and .+, .*?
  2. Check for quantifiers without upper bound such as x{14,}
  3. Check for too large range (e. g. x{1,300000})
  4. Check for big jumps in the hexadecimal strings
  5. Check for wild-cards characters - can they be specified more preciously, or could be string split into 2, omitting the wild-cards character?
  6. Check for alternations: can be split into 2 or more strings?
  7. Try to add specification for words matching (fullword, \b,...)

Note, in the next chapter Conditions and Short-Circuit Evaluation, a few tips for conditions are mentioned. However, the changes in them will not solve the too many matches and slowing down scanning errors.

Conditions and Short-Circuit Evaluation

Try to write condition statements in which the elements that are most likely to be "False" are placed first. The condition is evaluated from left to right. The sooner the engine identifies that a rule is not satisfied the sooner it can skip the current rule and evaluate the next one. The speed improvement caused by this way to order the condition statements depends on the difference in necessary CPU cycles to process each of the statements. If all statements are more or less equally expensive, reordering the statements causes no noticeable improvement. If one of the statements can be processed very fast it is recommended to place it first in order to skip the expensive statement evaluation in cases in which the first statement is FALSE.

Changing the order in the following statement does not cause a significant improvement:

$string1 and $string2 and uint16(0) == 0x5A4D

However, if the execution time of the statements is very different, reordering in order to trigger the short-circuit will improve the scan speed significantly:

SLOW

// EXPENSIVE and CHEAP
math.entropy(0, filesize) > 7.0 and uint16(0) == 0x5A4D

FAST

// CHEAP and EXPENSIVE
uint16(0) == 0x5A4D and math.entropy(0, filesize) > 7.0

Short-circuit evaluation was introduced to help optimizing expensive sentences, particularly "for" sentences. Some people were using conditions like the one in the following example:

strings:
	$mz = "MZ"
	...
condition:
	$mz at 0 and for all i in (1..filesize) : ( whatever )

Because filesize can be a very big number, "whatever" can be executed a lot of times, slowing down the execution. Now, with short-circuit evaluation, the "for" sentence will be executed only if the first part of the condition is met, so, this rule will be slow only for MZ files. An additional improvement could be:

$mz at 0 and filesize < 100KB and for all i in (1..filesize) : ( whatever )

This way a higher bound to the number of iterations is set.

From version 3.10, the integer range loops were also optimized:

for all i in (0..100): (false)
for any i in (0..100): (true)

Both of these loops will stop iterating after the first time through.

No Short-Circuit for Regular Expressions

Sadly this does not work with regular expressions because they're all initially fed into the string matching engine. The following example will slow down the search for any file and not just for those with filesize smaller than 200 bytes:

strings:
  $expensive_regex = /\$[a-z0-9_]+\(/ nocase
conditions:
  filesize < 200 and
  $expensive_regex

This "short-circuit" evaluation is applied since YARA version 3.4.

Metadata

Any data in the metadata section is read into the RAM by YARA. (You can easily test this by inserting 100,000 hashes into a rule and check the RAM usage of the YARA scan before and after.) Of course you don't want to permanently remove the metadata from the rules but if you're short on RAM, you could remove some unneeded parts of it in your workflow directly prior to scanning.

More Repositories

1

Loki

Loki - Simple IOC and YARA Scanner
Python
3,321
star
2

signature-base

YARA signature and IOC database for my scanners and tools
YARA
2,426
star
3

yarGen

yarGen is a generator for YARA rules
Python
1,518
star
4

auditd

Best Practice Auditd Configuration
1,448
star
5

Raccine

A Simple Ransomware Vaccine
C++
945
star
6

munin

Online hash checker for Virustotal and other services
Python
809
star
7

log4shell-detector

Detector for Log4Shell exploitation attempts
Python
729
star
8

Fenrir

Simple Bash IOC Scanner
Shell
680
star
9

yarAnalyzer

Yara Rule Analyzer and Statistics
Python
356
star
10

vti-dorks

Awesome VirusTotal Intelligence Search Queries
325
star
11

Fnord

Pattern Extractor for Obfuscated Code
Shell
295
star
12

BlueLedger

A list of my personal projects
166
star
13

DLLRunner

Smart DLL execution for malware analysis in sandbox systems
Python
141
star
14

god-mode-rules

God Mode Detection Rules
YARA
129
star
15

evt2sigma

Log Entry to Sigma Rule Converter
Python
104
star
16

yaraQA

YARA rule analyzer to improve rule quality and performance
Python
93
star
17

Cyber-Search-Shortcuts

Browser Shortcuts for Cyber Security Related Online Services
78
star
18

exotron

Sandbox feature upgrade with the help of wrapped samples
Python
75
star
19

Loki2

LOKI2 - Simple IOC and YARA Scanner
Rust
73
star
20

ImpHash-Generator

PE Import Hash Generator
Python
72
star
21

Rewind

Immediate Virus Infection Counter Measures
C#
62
star
22

radiocarbon

Leak File Analyzer
Python
62
star
23

tiny-shells

All kinds of tiny shells
59
star
24

panopticon

A YARA Rule Performance Measurement Tool
YARA
58
star
25

LOLSecIssues

Cybersecurity's lighter side: a collection of the most amusing misunderstandings and missteps from newcomers to offensive security tools. A repository where naiveté in infosec is met with humor.
57
star
26

ti-falsepositives

A collection of typical false positive indicators
Python
54
star
27

webshell-intel

Scan web server for known webshell names and responses
50
star
28

xorex

XOR Key Extractor
Python
48
star
29

Talks

Slides of my public talks
46
star
30

cyber-chef-recipes

Recipes for GCHQ's CyberChef Web App
35
star
31

sysmon-version-history

An Inofficial Sysmon Version History (Change Log)
32
star
32

littlesnitch-log-exporter

LittleSnitch Log Statistics Exporter
Python
32
star
33

YARA-Style-Guide

A specification and style guide for YARA rules
32
star
34

SkeletonKeyScanner

Scanner for the SkeletonKey Malware
Python
30
star
35

ThreatResearch-Reporting-Guide

Offensive Research Guide to Help Defense Improve Detection
29
star
36

prisma

Command Line STDOUT Colorer
Python
29
star
37

ReginScanner

Scanner for Regin Virtual Filesystems
Python
26
star
38

space-id

Invisible Watermarks with Space Characters in ASCII Files
Python
22
star
39

neolog

Windows Syslog Command Line Client
15
star
40

narsil

Spy Agency Teasing
Python
14
star
41

yara-uuid-generator

A tool that adds reproducible UUIDs to YARA rules
Python
13
star
42

WPWatcher

Wordpress Watcher is a wrapper for WPScan that manages scans on multiple sites and reports by email
Python
11
star
43

defensive-project-ideas

Ideas for projects for defensive research or blue teaming
10
star
44

agile-hacking

Collection of hacks that make use of the least available on victim systems
Visual Basic
8
star
45

CredsSpreader

A tool to spread canary credentials in your organisation
8
star
46

language-thor

Syntax Theme for THOR APT Scanner log files
5
star
47

yara-type-selectors

YARA rules to certain types of files without using YARA modules to avoid the performance impact
YARA
5
star
48

PassTweaker

Tweaks password files to match modern password requirements
Python
5
star
49

speedy

(Demo) - Only used to demonstrate a memory leak caused by Golang regexp
Go
4
star
50

loki-cloud

A flexible and lightweight way to execute LOKI on end systems
3
star
51

imphash-go

Imphash Generator
1
star