• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created over 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A sentence aligner for comparable corpora

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine
Gonzalo García Berrotarán
Rafael Carrascosa
Elías Andrawos
Laura Alonso Alemany

More Repositories

1

quepy

A python framework to transform natural language questions to queries in a database query language.
Python
1,254
star
2

iepy

Information Extraction in Python
Python
905
star
3

featureforge

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API
Python
381
star
4

mypy-django

PEP-484 type hints bindings for the Django web framework
Python
223
star
5

telegraphy

Telegraphy provides real time events for WSGI Python applications
JavaScript
202
star
6

refo

Regular expressions for objects
Python
143
star
7

satimg

Satellite data processing experiments
Jupyter Notebook
117
star
8

mypy-data

mypy typesheds for the Python data stack
Python
86
star
9

bidderd

RTBKIT Agent using Go and the HTTPInterface
Go
45
star
10

django-i18n-helper

Python
35
star
11

django-fasttest

A variant on django.test.TestCase optimized for postgres
Python
21
star
12

slides

Public talks by Machinalis
TeX
18
star
13

django-template-previewer

A Django app to allow developers preview templates
Python
17
star
14

mypy-django-example

A usage example for mypy-django
Python
15
star
15

django-test-autocomplete

Python
12
star
16

eff

Time tracking and report generation
Python
9
star
17

ninja-django-plugin

Django plugin for Ninja-IDE
Python
4
star
18

inventor

Inventor a very simple django based inventory system.
HTML
3
star
19

protobuf-python3

Google protobuf port to python3
C++
2
star
20

jquery_simple_progressbar

2
star
21

django-migration-tools

Scripts for helping with routine tasks while migration from 0.96 django versions to 1.x
Python
2
star
22

code_time_tracker

Python
1
star
23

ninja_ipython_console

An IPython console plugin for Ninja
Python
1
star
24

machinalis-movie-reviews

Python
1
star
25

alfajor

A site to collect shopping orders for packages of items, designed for an alfajor seller
Python
1
star