Naver sentiment movie corpus v1.0
This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.
The dataset construction is based on the method noted in Large movie review dataset from Maas et al., 2011.
Data description
- Each file is consisted of three columns:
id
,document
,label
id
: The review id, provieded by Naverdocument
: The actual reviewlabel
: The sentiment class of the review. (0: negative, 1: positive)- Columns are delimited with tabs (i.e.,
.tsv
format; but the file extension is.txt
for easy access for novices)
- 200K reviews in total
ratings.txt
: All 200K reviewsratings_test.txt
: 50K reviews held out for testingratings_train.txt
: 150K reviews for training
Characteristics
- All reviews are shorter than 140 characters
- Each sentiment class is sampled equally (i.e., random guess yields 50% accuracy)
- 100K negative reviews (originally reviews of ratings 1-4)
- 100K positive reviews (originally reviews of ratings 9-10)
- Neutral reviews (originally reviews of ratings 5-8) are excluded
Quick peek
$ head ratings_train.txt
id document label
9976970 μ λλΉ.. μ§μ§ μ§μ¦λλ€μ λͺ©μ리 0
3819312 ν ...ν¬μ€ν°λ³΄κ³ μ΄λ©μνμ€....μ€λ²μ°κΈ°μ‘°μ°¨ κ°λ³μ§ μꡬλ 1
10265843 λ무μ¬λ°μλ€κ·Έλμ보λκ²μμΆμ²νλ€ 0
9045019 κ΅λμ μ΄μΌκΈ°κ΅¬λ¨Ό ..μμ§ν μ¬λ―Έλ μλ€..νμ μ‘°μ 0
6483659 μ¬μ΄λͺ¬νκ·Έμ μ΅μ΄μ€λ° μ°κΈ°κ° λ보μλ μν!μ€νμ΄λ맨μμ λμ΄λ³΄μ΄κΈ°λ§ νλ 컀μ€ν΄ λμ€νΈκ° λ무λλ μ΄λ»λ³΄μλ€ 1
5403919 λ§ κ±Έμλ§ λ 3μΈλΆν° μ΄λ±νκ΅ 1νλ
μμΈ 8μ΄μ©μν.γ
γ
γ
...λ³λ°κ°λ μκΉμ. 0
7797314 μμμ κΈ΄μ₯κ°μ μ λλ‘ μ΄λ €λ΄μ§λͺ»νλ€. 0
9443947 λ³ λ°κ°λ μκΉλ€ μλμ¨λ€ μ΄μκ²½ κΈΈμ©μ° μ°κΈ°μνμ΄λͺλ
μΈμ§..μ λ§ λ°λ‘ν΄λ κ·Έκ²λ³΄λ¨ λ«κ²λ€ λ©μΉ.κ°κΈλ§λ°λ³΅λ°λ³΅..μ΄λλΌλ§λ κ°μ‘±λμλ€ μ°κΈ°λͺ»νλμ¬λλ§λͺ¨μΏλ€ 0
7156791 μ‘μ
μ΄ μλλ°λ μ¬λ―Έ μλ λͺμλλ μν 1