• Stars
    star
    125
  • Rank 284,739 (Top 6 %)
  • Language
    Python
  • Created about 7 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Analysis on the novel "the Story of a Stone"

Analysis on the Story of a Stone

Author: Yu Lou

Written for Python 3.

博文链接:用 Python 分析《红楼梦》

preprocess.py

Preprocess text.

Input

  • hlm.txt: original text

Output

  • preprocessing.txt: preprocessed text.

preprocess_chapters.py

Split text into chapters and preprocess them.

Input

  • hlm.txt: original text

Output

  • chapters(folder): preprocessed text. One file for each chapter, numbered from "1.text".

dict_creat.py

Creat dictionary.

Input

  • preprocessing.txt: preprocessed text.

Output

  • dict.csv: dictionary.

word_split.py

Split words apart.

Input

  • preprocessing.txt: preprocessed text.
  • dict.csv: dictionary.

Output

  • word_split.text: split text.

word_split_chapters.py

Split words apart in all chapters.

Input

  • preprocessing.txt: preprocessed text.
  • dict.csv: dictionary.
  • chapter(folder): preprocessed text for all chapters.

Output

  • chapter_split(folder): split text. One file for each chapter, numbered from "1.text".

word_count.py

Count words.

Input

  • word_split.text: split text.

Output

  • word_count.csv: counting result, sorted by number of occurrence.

word_count_chapters.py

Count words in each chapter.

Input

  • chapter_split(folder): split text for each chapter.

Output

  • word_count_chapters.csv: counting result. One line per word and one chapter per column.

analysis.py

Do PCA analysis. Show result on screen.

Prerequisite

"sklearn", "numpy" and "matplotlib" is needed to run this program.

Input

  • word_count_chapters.csv: word counting result for each chapter.

Output

  • components.csv: weights for each component.

suffix_tree.py

Library for suffix tree.

correctness_calculate.py

Calculate the correctness of word splitting algorithm.

Input

  • *_answer.txt: answer.
  • *_result.txt: result of the program.

("*" is file prefix)