• Stars
    star
    148
  • Rank 249,983 (Top 5 %)
  • Language OpenEdge ABL
  • License
    BSD 2-Clause "Sim...
  • Created almost 13 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Japanese Natural Langauge Processing Libraries

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私</tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">は</tok>
 </chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼</tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">を</tok>
 </chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5</tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日</tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前</tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、</tok>
 </chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author:Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note:Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author:Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author:pulkit[at]jaist.ac.jp [change at with @]

More Repositories

1

gobrew

Go version manager, written in Go. Super simple tool to install and manage Go versions. Install go without root. Gobrew doesn't require shell rehash.
Go
354
star
2

sentiment_classifier

Sentiment Classification using Word Sense Disambiguation
OpenEdge ABL
171
star
3

laravel-erd

Automatically generate ERD Diagrams from Model's relations in Laravel
PHP
128
star
4

ionic-animated-modal

When animate css and ionic modal meet
JavaScript
122
star
5

ionic-tinder-ui

Just a Tinder UI on Ionic
JavaScript
94
star
6

ionic-photo-browser

When F7 and ionic meet
JavaScript
75
star
7

ionic-animated-popup

When animate css and ionic POPUP meet
JavaScript
51
star
8

ionic-timeline

Ionic Timeline - Can use either Images or Icons for your timeline thingy
JavaScript
48
star
9

email_extractor

Yes it works! Email Extractor by Full Url Crawl. Extract emails and web urls from a website with full crawl or option depth of urls to crawl using terminal and python.
Python
48
star
10

UIViewAnimations-Demo

UIView Animations. Move up, down, rotate, zoom, scale bounce etc in a uiview, uilabel, uibutton etc,
Objective-C
41
star
11

ionic-parallax-profile

Parallax Profile with ionic, A profile header with slides and a like Button
JavaScript
37
star
12

instagram-bot-api

Instagram Bot using Nodejs and instagram api
CSS
34
star
13

json-to-html-table

Convert Nested JSON to HTML Table — React, Javascript.
CSS
30
star
14

listof

list of anything (Community driven list of anything) text :)
Ruby
27
star
15

laravel-alert-notifications

Send alert to email, microsoft teams from laravel app, when an exception occurs. Throttle is enabled by default.
PHP
26
star
16

cache-http

action/cache temporary alternative to get dependency cache on GHES for self-hosted runners
Go
17
star
17

action-cache-http

Action for caching dependencies on Github Enterprise via HTTP. Useful for self hosted runners
Shell
15
star
18

node-flickr

npm module for flickr api
JavaScript
14
star
19

Bayes

Bayesian Classification in Objective-C
Objective-C
8
star
20

go-app-reviews-scraper

Apple app store reviews and ratings scraper.
Go
8
star
21

re

Command Line Tool to execute commands in README.md file.
Go
8
star
22

golang-cheat-sheet

Go Lang cheat sheet. Easy searchable tool for features and syntax for go language.
TypeScript
8
star
23

instagram-autolike-script

JavaScript
7
star
24

subtitle_translator

subtitles translator, srt subtitles translate from English to Many Langauges including Japanese, Chinese etc from terminal.
JavaScript
7
star
25

tumblr-autolike-script

JS - Autolike 1000 photos a day while browsing tumblr from Chrome
JavaScript
6
star
26

framework7-rails

Gem for Framework7 which is a Full Featured HTML Framework For Building iOS Apps
Ruby
5
star
27

node-sentiwordnet

JavaScript
5
star
28

ionic-toast-notification

If your ionic app wants to show preview of notifications
JavaScript
5
star
29

email-extractor-online

Extract urls and email addresses by crawling website. Online Tool.
Vue
4
star
30

action-camo-purge

Github Action to purge githubusercontent camo cache. Purge shield badges and images cache from README.md
4
star
31

go-msteams

Go
4
star
32

japan-drivers-license-practice-test-questions-english

Japanese drivers license practice test questions in English
TypeScript
4
star
33

latex-writer

Online latex writer in the web browser
HTML
3
star
34

ionic-market-hack

How to get the source code for apps on the ionic-market that have android demo
3
star
35

node-edict

Python
3
star
36

instachart

Generate charts as images using API. Embed dynamic charts as images in Github Markdown.
Go
3
star
37

kevincobain2000.github.io

https://kevincobain2000.github.io
Astro
2
star
38

MemeGenerator

MEME GENERATOR using PHP MVC - Ethna + Design on bootstrap
PHP
2
star
39

sentiments

gem install sentiments
Ruby
2
star
40

sentiwordnet

Ruby
2
star
41

action-gobrew

Setup Go in Github Actions using Gobrew
2
star
42

web_sentiments

Extract data from web to perform sentiment analysis
Python
2
star
43

software-engineering-laws

2
star
44

medium-code-highlight

medium-code-highlight tool
CSS
2
star
45

go-glassdoor-scraper

Scrape Glassdoor reviews in Go to JSON. Glassdoor reviews API.
Go
2
star
46

HandGestureRecognition

hand gesture recognition using HSV from Videos & Images
C#
2
star
47

google-cloud-icons

Simple react app with ssr and csr with next.js for searching Google Cloud Icons
JavaScript
2
star
48

outlook-roomfinder

Find Meeting Rooms Automatically - Outlook Roomfinder
JavaScript
2
star
49

action-coveritup

All in One Code Coverage, bundle size, and other reports tracking tool. Self hosted codecov solution.
Go
2
star
50

Emergency-Git-working-V1

Objective-C
1
star
51

allergy-assassin-iphone

Objective-C
1
star
52

kevincobain2000

1
star
53

Quiz

Quiz Game iOS for Beginner
Objective-C
1
star
54

vagrant-django-mysql-ssl

Basic Setup, Rename myproject -> yourprojectname
Python
1
star
55

puzzle48

iPhone App
Objective-C
1
star
56

meme-generator-facebook

MEME GENERATOR using PHP and Post the resulting image to Facebook
PHP
1
star
57

dotfiles

Emacs Lisp
1
star
58

go-gibberish

Gibberish text detector in Go. Detect if a sentence is meaning full or just jumbled words.
Go
1
star
59

certgen

Generate SSL certs for GRPC and HTTPS using openssl
Shell
1
star
60

ionic-api

Sorry, Nothing to opensource here..just using raw.githubusercontent for app apis
CSS
1
star
61

ionic-notifications

When You someone sends you a message!
JavaScript
1
star
62

prime_algorithms

prime number algorithms python
Python
1
star
63

action-scc

Sloc, Cloc and Code Action to update README.md with code stats.
Go
1
star
64

vuepress-theme-monokai

Yet another vuepress simple theme with enhanced side bar, font face
CSS
1
star
65

Slide48

Objective-C
1
star
66

socket-express-chat

A simple BBS app using socket.io, express (ejs). Shows number of connected clients and broadcasts the message
CSS
1
star
67

python-socket-chat

python-socket-chat telnet chat with socket and vagrant /rooms and /private message
Python
1
star
68

clickjackable

Tells if a url is clickjackable or not. Please note that it only checks for the X-FRAME-OPTIONS
Ruby
1
star
69

ethna-english-doc

Ethna English Documentation PHP
JavaScript
1
star
70

ionic-timeline-v3

ionic-timeline-v3
TypeScript
1
star
71

friends

JavaScript
1
star
72

letterfx

gem for jquery-letterfx. For demo see: http://tuxsudo.com/code/project/letterfx
Ruby
1
star
73

idealform-typeahead

IdealForms with Typeahead
JavaScript
1
star
74

gfycat

A PHP interface to the Gfycat API. for laravel and other frameworks
PHP
1
star
75

pinterest-autolike-script

does what it says in title
JavaScript
1
star
76

miller-rabin-primality-test-online

Miller Rabin Primality Test Online - big integers - Javascript Implementation webtool
Vue
1
star
77

mysql-vs-postgresql

https://medium.com/web-developer/mysql-vs-postgresql-performance-test-with-laravel-api-for-simple-eloquent-queries-on-1-million-6e0e6f1005b8
PHP
1
star