• Stars
    star
    804
  • Rank 56,681 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Crawl BookCorpus

Homemade BookCorpus

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Clawling could be difficult due to some issues of the website. Also, please consider another option such as using publicly available files at your own risk.

For example,

And, a paper by Jack Bandy and Nicholas Vincent is also valuable for understanding how "BookCorpus" and its replicates include several deficiencies.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


These are scripts to reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus. Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts | python tokenize_sentlines.py > all.tokenized.txt

Disclaimer

For example, you can refer to terms of smashwords.com. Please use the code responsibly and adhere to respective copyright and related laws. I am not responsible for any plagiarism or legal implication that rises as a result of this repository.

Requirement

  • python3 is recommended
  • beautifulsoup4
  • progressbar2
  • blingfire
  • html2text
  • lxml
pip install -r requirements.txt

Note on Errors

  • It is expected some error messages are shown, e.g., Failed: epub and txt, File is not a zip file or Failed to open. But, the number of failures will be much less than one of successes. Don't worry.

Acknowledgement

epub2txt.py is derived and modified from https://github.com/kevinxiong/epub2txt/blob/master/epub2txt.py

Citation

If you found this code useful, please cite it with the URL.

@misc{soskkobayashi2018bookcorpus,
    author = {Sosuke Kobayashi},
    title = {Homemade BookCorpus},
    howpublished = {\url{https://github.com/BIGBALLON/cifar-10-cnn}},
    year = {2018}
}

Also, the original papers which made the original BookCorpus are as follows:

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books." arXiv preprint arXiv:1506.06724, ICCV 2015.

@InProceedings{Zhu_2015_ICCV,
    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}
}
@inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
}

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. "Skip-Thought Vectors." arXiv preprint arXiv:1506.06726, NIPS 2015.

@article{kiros2015skip,
    title={Skip-Thought Vectors},
    author={Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja},
    journal={arXiv preprint arXiv:1506.06726},
    year={2015}
}

More Repositories

1

attention_is_all_you_need

Transformer of "Attention Is All You Need" (Vaswani et al. 2017) by Chainer.
Jupyter Notebook
313
star
2

bert-chainer

Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Python
220
star
3

dynamic_routing_between_capsules

Implementation of Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, NIPS 2017
Python
205
star
4

convolutional_seq2seq

fairseq: Convolutional Sequence to Sequence Learning (Gehring et al. 2017) by Chainer
Python
65
star
5

arxiv_leaks

Whisper of the arxiv: read comments in tex of papers
Python
31
star
6

chainer-openai-transformer-lm

A Chainer implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
Python
28
star
7

der-network

Dynamic Entity Representation (Kobayashi et al., 2016)
Python
21
star
8

variational_dropout_sparsifies_dnn

Variational Dropout Sparsifies Deep Neural Networks (Molchanov et al. 2017) by Chainer
Python
19
star
9

captioning_chainer

A fast implementation of Neural Image Caption by Chainer
Python
16
star
10

efficient_softmax

BlackOut and Adaptive Softmax for language models by Chainer
Python
11
star
11

ROCStory_skipthought_baseline

A novel baseline model for Story Cloze Test and ROCStories
Python
11
star
12

dynamic_neural_text_model

A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse, Sosuke Kobayashi, Naoaki Okazaki, Kentaro Inui, IJCNLP 2017
9
star
13

interval-bound-propagation-chainer

Sven Gowal et al., Scalable Verified Training for Provably Robust Image Classification, ICCV 2019
Jupyter Notebook
8
star
14

turnover_dropout

Python
7
star
15

learning_to_learn

Learning to learn by gradient descent by gradient descent, Andrychowicz et al., NIPS 2016
Python
7
star
16

decode_from_mask

Generate a sentence from a masked sentence
Python
6
star
17

weight_normalization

Weight Normalization (Salimans and Kingma, 2016) by Chainer
Python
6
star
18

SDCGAN

Sentence generation by DCGAN
Python
5
star
19

elmo-chainer

Chainer implementation of contextualized word representations from bi-directional language models. Copied into https://github.com/chainer/models/tree/master/elmo-chainer
Python
5
star
20

emergence_of_language_using_discrete_sequences

Emergence of Language Using Discrete Sequences
Jupyter Notebook
4
star
21

skip_thought

Language Model and Skip-Thought Vectors (Kiros et al. 2015)
Python
3
star
22

vqvae_chainer

Chainer's Neural Discrete Representation Learning (Aaron van den Oord et al., 2017)
Python
3
star
23

twitter_conversation_crawler

For crawling conversational tweet threads; e.g. datasets for chatbots.
Python
2
star
24

sru_language_model

Language modeling experiments of SRU and variants
Python
2
star
25

rnnlm_chainer

A Fast RNN Language Model by Chainer
Python
2
star