• Stars
    star
    160
  • Rank 226,231 (Top 5 %)
  • Language
    PHP
  • License
    Other
  • Created over 11 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Lexing experiments in PHP

Phlexy

This project is a followup to my post on fast lexing in PHP. It contains a few lexer implementations (both stateless and stateful) and related performance tests.

Usage

Lexers are created from a lexer definition using a factory class.

For example, if you want to create a MARK based stateless CSV lexer, you can use the following code:

<?php
require 'path/to/vendor/autoload.php';

$factory = new Phlexy\LexerFactory\Stateless\UsingMarks(
    new Phlexy\LexerDataGenerator
);

$lexer = $factory->createLexer(array(
    '[^",\r\n]+'                     => 0, // 0, 1, 2, 3 are the tokens
    '"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants
    ','                              => 2,
    '\r?\n'                          => 3,
));

$tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");

Similarly a stateful lexer:

<?php
require 'path/to/lib/Phlexy/bootstrap.php';

$factory = new Phlexy\LexerFactory\Stateful\UsingMarks(
    new Phlexy\LexerDataGenerator
);

// The "i" is an additional modifier (all createLexer methods accept it)
$lexer = $factory->createLexer($lexerDefinition, 'i');

For an example of a stateful lexer definition, you can look the definition for lexing PHP source code.

Performance

A performance comparison for the different lexer implementations can be done using the performance testing script:

$ php-7.2 examples/performanceTests.php

Timing lexing of CVS data:
Took 0.55736708641052 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.526859998703 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.49272608757019 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.5570011138916 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.46333193778992 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "a":
Took 0.58650183677673 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.754310131073 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.70682787895203 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.76406478881836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.62837815284729 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "z":
Took 0.79967403411865 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.30202317237854 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.29198718070984 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.36609601974487 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.12433409690857 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of random string:
Took 1.1720998287201 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.5946900844574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.55696296691895 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.6708779335022 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.33155107498169 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing PHP lexing of this file:
Took 0.151211977005 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.025480031967163 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.007037878036499 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Timing PHP lexing of larger TestAbstract file:
Took 0.49794602394104 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.083348035812378 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.019592046737671 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Stateless\Simple and Stateful\Simple are trivial lexer implementations (which loop through the regular expressions).

Stateless\WithoutCapturingGroups, Stateless\WithCapturingGroups and Stateful\UsingCompiledRegex use the compiled regex approach described in the blog post mentioned above.

Stateless\UsingPregReplace is an extension of the compiled regex approach, where the looping through the regular expression is done by (mis)using preg_replace_callback.

Stateless\UsingMarks and Stateful\UsingMark use the (*MARK) mechanism that was exposed in PHP 5.5.

As the above performance measurments show, the Simple approach is a good bit slower than using a compiled regex approach. Mark based implementation perform much better than group offset based ones. The benefits increase with lexer size: For the CSV lexer there is relatively little difference, while for the PHP lexer the mark based implementation is 25x faster than the naive one.

More Repositories

1

PHP-Parser

A PHP parser written in PHP
PHP
16,432
star
2

FastRoute

Fast request router for PHP
PHP
4,986
star
3

scalar_objects

Extension that adds support for method calls on primitive types in PHP
PHP
1,120
star
4

iter

Iteration primitives using generators
PHP
1,108
star
5

php-ast

Extension exposing PHP 7 abstract syntax tree
PHP
900
star
6

PHP-Fuzzer

Experimental fuzzer for PHP libraries
PHP
394
star
7

DB

Very simple and secure PDO wrapper class
PHP
92
star
8

php-crater

Like crater, but for PHP
PHP
68
star
9

include-interceptor

Library to intercept and dynamically transform PHP includes. Forked from icewind1991/interceptor.
PHP
68
star
10

sample_prof

Sampling profiler for PHP
C
58
star
11

comparable

PHP extension implementing a magic "Comparable" interface
C
49
star
12

TypeUtil

Utility for adding PHP 7 scalar types and return types
PHP
47
star
13

ditaio

Cooperative task scheduler using coroutines
PHP
42
star
14

buffer

PHP extension for buffer based typed arrays
C
39
star
15

popular-package-analysis

Utilities to analyze most popular composer packages
PHP
38
star
16

nikic.github.com

CSS
36
star
17

llvm-compile-time-tracker

LLVM compile-tracking tracking infrastructure
PHP
33
star
18

Phuzzy

Fuzzer for PHP internal functions
PHP
31
star
19

prephp

Preprocesses PHP code to allow custom syntax and syntax of newer PHP versions, especially PHP 5.3
PHP
29
star
20

PHP-Backporter

Converts PHP 5.3 code to PHP 5.2 code
PHP
22
star
21

llvm-compile-time-data

LLVM compile-time performance data over time (repo 0).
18
star
22

tokenstream

Framework around PHP's Tokenizer to allow easier working with it
PHP
8
star
23

SPL-Datastructures

C
8
star
24

pear

Repo for publishing my PEAR packages
CSS
3
star
25

llvm-ir-diffs

LLVM
1
star
26

test-repo

For testing things.
1
star