• Stars
    star
    1,077
  • Rank 42,945 (Top 0.9 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 3 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

非常全的文言文(古文)-现代文平行语料

文言文(古文)- 现代文平行语料

一、语料简介

这是一个非常全的文言文(古文)- 现代文平行语料,基本涵盖了大部分经典古籍著作。从文学角度出发,本项目将所有古文原文整理至文件夹 古文原文 中,并对每本古籍,按篇章/章节进行划分与展示,正文部分存于各章节下的 text.txt 中,例如 论语/学而篇/text.txt孟子/梁惠王章句上/第一节/text.txt 。对于平行数据,本项目整理至文件夹 双语数据 中,这些双语数据是以句子级别为单位进行划分,本项目提供了原文、译文、双语三种数据格式,例如:论语/学而篇/source.txt论语/学而篇/target.txt论语/学而篇/bitext.txt 。注:所有数据均按行保留了古文原文的相对顺序,即数据非打乱。

本语料数据来源于互联网1,所爬取到的原始数据是篇章级对齐的双语数据,经过脚本进行分句、对齐,处理成了句子级别对齐的双语(平行)数据,共计 972467 句。核心对齐思路采用归一化编辑距离算法与长度比指标。

需要注意 双语数据 文件夹中古文数据量少于 古文原文 文件夹中的古文数据,这是因为数据来源中部分古文没有译文,也有部分古文的译文残缺,故 双语数据 文件夹中仅收录了包含双语句对的数据。

二、复现过程

本项目提供了本语料的处理过程及相关脚本,具体过程详见复现

三、统计信息

古文原文共包含327本书籍。双语数据共包含97本书籍,其中包含句子级别对齐句子共计 972467 个句对。详细统计信息可查看统计信息

四、声明

本语料数据均来自互联网。所有数据均注明了出处,可详见各书目下文件 数据来源.txt 。原始数据的最终解释权归相关数据来源方所有。

感谢为该语料库做出贡献的成员:谈修泽、罗应峰。

五、更新历史

v2.0 2023年3月 重新整理数据,保留更加详尽的原始数据信息,并注明出处

v1.0 2022年2月 数据的初始整理

More Repositories

1

MTBook

《机器翻译:基础与模型》肖桐 朱靖波 著 - Machine Translation: Foundations and Models
TeX
2,712
star
2

ABigSurvey

A collection of 1000+ survey papers on Natural Language Processing (NLP) and Machine Learning (ML).
1,981
star
3

CNSurvey

一份中文综述文章列表(自然语言处理&机器学习)
548
star
4

NiuTensor

NiuTensor is an open-source toolkit developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. It provides tensor utilities to create and train neural networks.
C++
379
star
5

ABigSurveyOfLLMs

A collection of 150+ surveys on LLMs
172
star
6

NiuTrans.SMT

NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.
C++
144
star
7

NiuTrans.NMT

A Fast Neural Machine Translation System developed in C++.
C++
136
star
8

MT-paper-lists

MT paper lists (by conference)
123
star
9

NASPapers

Paper lists of neural architecture search (NAS)
121
star
10

LanguageCodes

We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
79
star
11

compiler-notes

60
star
12

Introduction-to-Transformers

An introduction to basic concepts of Transformers and key techniques of their recent advances.
46
star
13

Vision-LLM-Alignment

This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.
Python
41
star
14

MTVenues

A list of conferences and journals relevant to machine translation
33
star
15

Hands-on-GEMM

A tutorial on GEMM
Cuda
7
star