• Stars
    star
    1
  • Language
    Julia
  • License
    Other
  • Created over 10 years ago
  • Updated over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Implement the NGram model in julia

NGram

Linear interpolation

This implementation uses the linear interpolation to build the model. For example, with a simple trigram model

p("book" | "the", "green") = count("the green book") / count("the green")

But there are some limitations

  • We need a bigger corpus to efficiently train a trigram model compared to bigram or unigram
  • Count(trigram) is often equal to zero
  • With bigram or unigram we don't capture as much information

The idea is then to combine the results of trigram with bigram and unigram. We can generalize by saying that to compute ngram, we also use the results of (n-1)gram, ..., bigram, unigram. Here is an exemple in the case of a trigram model.

p("book" | "the", "green") = a * count("the green book") / count("the green")
                          +  b * count("the green") / count("the")
                          +  c * count("the") / count()
    where
        a + b + c = 1
        a >= 0
        b >= 0
        c >= 0

# For example: a = b = c = 1 / 3

Example

using NGram

texts = String["the green book", "my blue book", "his green house", "book"]

# Train a trigram model on the documents
model = NGramModel(texts, 3)

# Query on the model
# p(book | the, green)
model["the green book"]

More Repositories

1

tldts

JavaScript Library to work against complex domain names, subdomains and URIs.
TypeScript
270
star
2

Hodor

One Brainfuck interpreter to rule them all!
Brainfuck
36
star
3

Bison-Flex-CPP-template

A C++ template for Bison / Flex projects
C++
31
star
4

bandwidth-monitor

Simple project to continuously measure the bandwidth of your home Internet connection
Python
19
star
5

blockrz

Super minimal blocker of ads, trackers and annoyances.
TypeScript
16
star
6

wgraph

Etymological graphs based on Wiktionary dumps
Python
14
star
7

mono

A monorepo of TypeScript libraries I maintain
TypeScript
13
star
8

https-everywhere-core

A more efficient matching engine for HTTPS Everywhere written in pure JavaScript
TypeScript
11
star
9

privacy_bot

Privacy bot crawls privacy policies of popular domains, persist them and analyze them.
Python
6
star
10

remusao.github.io

My personal blog
CSS
5
star
11

LDA.jl

*Deprecated* - Linear Discriminant Analysis and Kernel Fisher Analysis
Julia
5
star
12

simple-sanic

A faster http.server using sanic
Python
3
star
13

Wikipedia.jl

*Deprecated* - A julia wrapper over Wikipedia API
Julia
3
star
14

tsmaz

tsmaz has moved to https://github.com/remusao/mono/tree/master/packages/smaz
3
star
15

firefox-67-idb-bug-repro

Minimal extension to reproduce an IndexedDB bug happening in Firefox 67
JavaScript
3
star
16

node-app-packaging-template

Minimal template to package Node.js projects: minified bundle, self-contained executable, docker
JavaScript
2
star
17

twitter-adblocker

Simple extension to block ads on Twitter
JavaScript
2
star
18

Sumup.jl

Automatic multi-documents, multi-topics summarization based on topic extraction
Julia
2
star
19

Brainfuck.jl

Simple brainfuck interpreter written in Julia
Julia
1
star
20

site-pinning

Web-extension allowing to pin sub-resources to get reproducible page loading
JavaScript
1
star
21

Prpa

A multi-threaded video stream processing.
C++
1
star
22

broxy

Your Privacy Bro!
JavaScript
1
star
23

PythonMatching

A Python module that allows OCaml-like pattern-matching.
Python
1
star
24

metaheuristic

Metaheuristic toolbox
C++
1
star
25

thunderbird-msg-filters

thunderbird-msg-filters has moved to https://github.com/remusao/mono/tree/master/packages/thunderbird-msg-filters
1
star
26

PyCheck

A general purpose TestSuit written in Python for projects of any size.
Python
1
star
27

Data_Structs

A set of generic data structures to use in a C project
C++
1
star
28

KSVM.jl

Kernel SVM written in Julia
Julia
1
star
29

katas

Katas'trophe
Python
1
star
30

CPP_Coding_Style_Checker

A program that check a C++ file (source or header) and output each coding bad style on output.
C++
1
star
31

haskell-goggle

Brave Search Goggle to rerank results and boost content related to the Haskell programming language.
1
star