• Stars
    star
    320
  • Rank 131,126 (Top 3 %)
  • Language
    PHP
  • Created over 11 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PHP Class to detect languages from any free text

LanguageDetector Build Status Flattr this git repo

PHP Class to detect languages from any free text.

It follows the approach described in the paper, a given text is tokenized into N-Grams (we cleanup whitespaces before doing this step). Then we sort the tokens and we compare against a language model.

How it works

The first thing we need is a language model (which looks like this file) that is used to compare the texts against at classification time. This process must done before anything, and it can be generated with an script similar to this file.

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// it could use a little bit of memory, but it's fine
// because this process runs once.
ini_set('memory_limit', '1G');

// we load the configuration (which will be serialized
// later into our language model file
$config = new LanguageDetector\Config;

$c = new LanguageDetector\Learn($config);
foreach (glob(__DIR__ . '/samples/*') as $file) { 
    // feed with examples ('language', 'text');
    $c->addSample(basename($file), file_get_contents($file));
}

// some callback so we know where the process is 
$c->addStepCallback(function($lang, $status) {
    echo "Learning {$lang}: $status\n";
});

// save it in `datafile`. 
// we currently support the `php` serialization but it's trivial
// to add other formats, just extend `\LanguageDetector\Format\AbstractFormat`. 
//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php
$c->save(AbstractFormat::initFormatByPath('language.php'));

Once we have our language model file (in this case language.php) we're ready to classify texts by their language.

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// we load the language model, it would create
// the $config object for us.
$detect = LanguageDetector\Detect::initByPath('language.php');

$lang = $detect->detect("Agricultura (-ae, f.), sensu latissimo, 
est summa omnium artium et scientiarum et technologiarum quae de 
terris colendis et animalibus creandis curant, ut poma, frumenta, 
charas, carnes, textilia, et aliae res e terra bene producantur. 
Specialius, agronomia est ars et scientia quae terris colendis student, 
agricultio autem animalibus creandis.")

var_dump($lang);

And that's it.

Algorithms

The project is designed to work with modules, which means you can provide your own algorithm for sorting and comparing the N-Grams. By default the library implements the PageRank as sorting algorithm, and out of place (described in the paper) as comparing.

In order to supply your own algorithms, you must change the $config at learning stage to load your own classes (which by the way should implement some interaces).

More Repositories

1

ActiveMongo

Simple and efficient ActiveRecord data abstraction for MongoDB
PHP
142
star
2

Haanga

Template compiler for PHP, Django-style (as much as possible). Pretty efficent by avoiding to have anything at run-time.
PHP
136
star
3

TextRank

extract relevant keywords from a given text
PHP
104
star
4

InfluxPHP

Simple PHP client for InfluxDB
PHP
60
star
5

php-git

Pure PHP script which allows to performs read-only operations plus repositories clone (over HTTP).
PHP
53
star
6

Notoj

Yet another annotation parser (DocBlocks)
PHP
37
star
7

MongoFS

PHP Streams Wrappers for MongoDB GridFS
PHP
22
star
8

phpcluster

Unsupervised learning algorithm in PHP
PHP
16
star
9

SQLParser

SQL-Parser
PHP
13
star
10

textrank-old

Class to extract relevant words from a given text
PHP
13
star
11

Autoloader

Proper autoloader for PHP
PHP
11
star
12

Bancard

Libreria para pagos con Bancard
PHP
11
star
13

PHPJS

Run javascript inside PHP (powered by duktape.org)
C
11
star
14

WatchFiles

Stateless way of watching files and directory for changes
PHP
9
star
15

corruptos.net

Código fuente del sitio web
JavaScript
9
star
16

CRouting

Pretty efficient URL Router in PHP.
JavaScript
8
star
17

PHPTextCat

TextCat (www.github.com/crodas/TextCat) binding for PHP.
C
8
star
18

Artifex

PHP Code generator for mere mortals
PHP
8
star
19

Dispatcher

Creates an url dispatcher for your project
PHP
8
star
20

EasySQL

Easiest SQL abstraction ever.
PHP
7
star
21

php-hadoop

Simple PHP set of scripts to wrap Hadoop
PHP
7
star
22

ActiveMongo2

PHP database abstraction for MongoDB
PHP
7
star
23

TextCat

Simple and lightweight library to classify text using N-Grams (useful to detect language)
C
6
star
24

Haanga2

Clean rewrite of crodas/Haanga :-). This isn't finish. It is just a repository to show the progress. Do not use it yet.
PHP
6
star
25

Phar-Builder

Build Phar like a boss
PHP
5
star
26

SimpleView

Simple view engine based on Laravel's Blade view Engine
PHP
5
star
27

Haanga-web

Haanga Web site source code
PHP
5
star
28

ClassInfo

Get classes and functions defined in a given file
PHP
5
star
29

Slides

The source code of my slides
JavaScript
5
star
30

brhackday

PHP
4
star
31

py-languess

Language detector implemented in Python
Python
4
star
32

PHP-HttpParser

Pretty Optimized Http parser for PHP (binding for node's http parser)
C
4
star
33

Eath

Super simple package installer for PHP
PHP
4
star
34

DbBackup

Easiest incremental backup tools for (My?)SQL databases.
PHP
4
star
35

framework

Tiny framework that I use daily, it is basically my libraries glued all together
PHP
3
star
36

woocommerce-bancard

PHP
3
star
37

textcat-rs

N-Gram-Based Text Categorization
Rust
3
star
38

Validator

Generate static validators to validate your data in PHP.
PHP
3
star
39

SimpleAssetManager

Extremely simple asset manager for PHP
PHP
3
star
40

pagerank-rs

Generic PageRank algorithm implementation in Rust with no external dependency
Rust
3
star
41

old-Tuicha

Simple mongodb abstraction for PHP. Tuicha means huge (humongous) in Guarani.
PHP
2
star
42

Autocomplete

Simple and efficient autocomplete
PHP
2
star
43

base64-secret-rs

Base64 encoder/decoder with custom alphabet. The alphabet is sorted by a given key. The sorting is always deterministic.
Rust
2
star
44

ServiceProvider

Little configuration manager and dependency injection
PHP
2
star
45

php-ctypes

libffi extension wrapper for PHP
2
star
46

cli

Simple and silly abstraction on top of symfony/console
PHP
2
star
47

Packed

Pack your PHP-libraries, share with everybody
2
star
48

FileUtils

Useful libraries to manipulate files, paths and anything related with files
PHP
2
star
49

crodas.github.com

My personal home page at github
2
star
50

ServerTimeSync.js

Use the server time at the client side
C
2
star
51

HashRouter.js

Super simple URL dispatcher for single page apps.
JavaScript
2
star
52

Meneame.net

Méneame mirror
PHP
2
star
53

ActiveMongo2Laravel

ActiveMongo2 for Laravel4
PHP
2
star
54

wbp

Wordpress Blogs' Planetarium
PHP
2
star
55

Sorting

Simple class to sort elements
JavaScript
2
star
56

simple-calc

University home-work, simple calculator implemented in C
C
2
star
57

Base64Secret

Library to encode/decode data with base64 with a custom alphabet, which is determined by a given secret key
PHP
2
star
58

Tuicha

Simple ORM for MongoDB (PHP and HHVM)
PHP
1
star
59

microredis

Redis server implemented in rust.
Tcl
1
star
60

php-odm-benchmark

PHP
1
star
61

Remember

Easiest way to remember things across requests in PHP
PHP
1
star
62

redis-protocol-parser

Redis Protocol Parser. A zero copy stream-friendly parser
Rust
1
star
63

SitemapGenerator

Very simple sitemap (offline) generator.
PHP
1
star
64

level-blob

Store blobs of data in streams in LevelUp
JavaScript
1
star
65

js-test

JavaScript
1
star
66

FSANode

JavaScript
1
star
67

notification

C Webserver to send notifications in real time efficiently
C
1
star
68

Worker

Queue-agnostic way of doing asynchronous jobs in PHP
PHP
1
star
69

Dataset

Some dataset I gathered over the years
1
star
70

crodas

1
star
71

ApiServer

Microframework which makes an childish game to create API servers.
PHP
1
star
72

lingo-rs

Standalone library and program for natural language classification.
Rust
1
star
73

Build

Simple building scripts
PHP
1
star