• Stars
    star
    177
  • Rank 215,923 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai

PyPI version Python Versions License Downloads

CI Typos CodeQL Maintainability Test Coverage markdownlint jsonlint yamllint

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.

Quick Start

Install

$ pip install -U bunkai

Disambiguation without Models

$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
  • Feed a document as one line by using (U+2581) for line breaks.
    1行は1つの文書を表します.文書中の改行は (U+2581) で与えてください.
  • The output shows sentence boundaries with (U+2502).
    出力では文境界は (U+2502) で表示されます.

Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,--modelオプションを与える必要があります.

First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください.

$ pip install -U 'bunkai[lb]'

Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.

$ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.
そしてモデルを指定して動かしてください.

$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

Morphological Analysis Result

You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます.

It can be used with the --model option.
--modelオプションと同時に使えます.

$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma --model bunkai-model-directory
形態素	名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ

ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS
結果	名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
 	記号,空白,*,*,*,*, ,*,*
表示	名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
!	記号,一般,*,*,*,*,!,!,!
EOS

Python Library

You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.

from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
    print(sentence)

改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください.
If you want to disambiguate line breaks too, please designate the model path where you set up.

from pathlib import Path

from bunkai import Bunkai

bunkai = Bunkai(path_model=Path("bunkai-model-directory"))
for sentence in bunkai("そうなんです▁このように▁pythonライブラリとしても▁使えます!"):
    print(sentence)

"""
Output:
そうなんです▁
このように▁pythonライブラリとしても▁使えます!
"""

For more information, see examples.
ほかの例はexamplesをご覧ください.

Documents

References

  • Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

More Repositories

1

ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
Python
727
star
2

HappyDB

A corpus of 100,000 happy moments
354
star
3

ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Python
233
star
4

sato

Code and data for Sato https://arxiv.org/abs/1911.06311.
Python
107
star
5

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Python
75
star
6

opiniondigest

OpinionDigest: A Simple Framework for Opinion Summarization (ACL 2020)
Python
56
star
7

vecscan

Python
49
star
8

SubjQA

A question-answering dataset with a focus on subjective information
40
star
9

t5-japanese

Codes to pre-train Japanese T5 models
Python
39
star
10

ruler

Data Programming by Demonstration (DPBD) for Document Classification
Jupyter Notebook
36
star
11

tagruler

Data programming by demonstration for information extraction and span annotation
JavaScript
35
star
12

coop

☘️ Code for Convex Aggregation for Opinion Summarization (Iso et al; Findings of EMNLP 2021)
Python
31
star
13

doduo

Annotating Columns with Pre-trained Language Models
Python
25
star
14

asdc

Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
Python
23
star
15

instruction_ja

Japanese instruction data (日本語指示データ)
Python
21
star
16

rotom

Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
Roff
21
star
17

cocosum

🥥 Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)
Python
20
star
18

ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
PLSQL
17
star
19

ginza-transformers

Use custom tokenizers in spacy-transformers
Python
17
star
20

teddy

Code and data for Teddy https://arxiv.org/abs/2001.05171.
Python
15
star
21

zett

🙈 Code for Zero-shot Triplet Extraction by Template Infilling (Kim et al; IJCNLP-AACL 2023)
Python
15
star
22

machamp

The dataset for the paper "Machamp: A Generalized Entity Matching Benchmark" published in CIKM 2021
14
star
23

starmie

Resources for PVLDB 2023 submission
Python
14
star
24

meganno-client

Python
7
star
25

sudowoodo

The source code of the Sudowoodo paper in ICDE 2023
Jupyter Notebook
7
star
26

explainit

Python
5
star
27

desuwa

Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
Emacs Lisp
5
star
28

react-jupyter-cookiecutter

Python
5
star
29

xatu

🕊️ Code and Data for XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates (Zhang et al; LREC-COLING 2024)
Python
4
star
30

magneton

Repository of the Magneton framework for authoring interaction-aware and customizable widgets.
TypeScript
4
star
31

emu

Enhancing Multilingual Sentence Embeddings with Semantic Specialization (AAAI '20)
4
star
32

learnit

A Tool for Machine Learning Beginners
Python
4
star
33

leam

Source code and demo for Leam
Jupyter Notebook
3
star
34

minun

Evaluating Counterfactual Explanations for Entity Matching
Python
3
star
35

llm-longeval

💵 Code for Less is More for Long Document Summary Evaluation by LLMs (Wu, Iso et al; EACL 2024)
Python
3
star
36

jrte-corpus_example

Example codes for Japanese Realistic Textual Entailment Corpus
Python
3
star
37

Tyrogue

Jupyter Notebook
2
star
38

qa-summarization

Ting-Yao's intern project
Python
2
star
39

pilota

✈ SCUD generator (解釈文生成器)
Python
1
star
40

quasi_japanese_reviews

Quasi Japanese Reviews (擬似レビューデータ)
Python
1
star
41

MCR

1
star
42

witqa

1
star