• Stars
    star
    218
  • Rank 180,887 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A korean news crawler built to ingest large amounts of news data.

KoreaNewsCrawler

์ด ํฌ๋กค๋Ÿฌ๋Š” ๋„ค์ด๋ฒ„ ํฌํ„ธ์— ์˜ฌ๋ผ์˜ค๋Š” ์–ธ๋ก ์‚ฌ ๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์„ ํฌ๋กค๋ง ํ•ด์ฃผ๋Š” ํฌ๋กค๋Ÿฌ์ž…๋‹ˆ๋‹ค.
ํฌ๋กค๋ง ๊ฐ€๋Šฅํ•œ ๊ธฐ์‚ฌ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” ์ •์น˜, ๊ฒฝ์ œ, ์ƒํ™œ๋ฌธํ™”, IT๊ณผํ•™, ์‚ฌํšŒ, ์„ธ๊ณ„, ์˜คํ”ผ๋‹ˆ์–ธ์ž…๋‹ˆ๋‹ค.
์Šคํฌ์ธ  ๊ธฐ์‚ฌ๊ฐ™์€ ๊ฒฝ์šฐ ํ•ด์™ธ์•ผ๊ตฌ, ํ•ด์™ธ์ถ•๊ตฌ, ํ•œ๊ตญ์•ผ๊ตฌ, ํ•œ๊ตญ์ถ•๊ตฌ, ๋†๊ตฌ, ๋ฐฐ๊ตฌ, ๊ณจํ”„, ์ผ๋ฐ˜ ์Šคํฌ์ธ , e์Šคํฌ์ธ ์ž…๋‹ˆ๋‹ค.

How to install

pip install KoreaNewsCrawler

Method

  • set_category(category_name)

์ด ๋ฉ”์„œ๋“œ๋Š” ์ˆ˜์ง‘ํ•˜๋ ค๊ณ ์ž ํ•˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋Š” ์„ค์ •ํ•˜๋Š” ๋ฉ”์„œ๋“œ์ž…๋‹ˆ๋‹ค.
ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋Š” '์ •์น˜', '๊ฒฝ์ œ', '์‚ฌํšŒ', '์ƒํ™œ๋ฌธํ™”', 'IT๊ณผํ•™', '์„ธ๊ณ„', '์˜คํ”ผ๋‹ˆ์–ธ'์ž…๋‹ˆ๋‹ค.
ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
category_name: ์ •์น˜, ๊ฒฝ์ œ, ์‚ฌํšŒ, ์ƒํ™œ๋ฌธํ™”, IT๊ณผํ•™, ์„ธ๊ณ„, ์˜คํ”ผ๋‹ˆ์–ธ or politics, economy, society, living_culture, IT_science, world, opinion

  • set_date_range(startyear, startmonth, endyear, endmonth)

์ด ๋ฉ”์„œ๋“œ๋Š” ์ˆ˜์ง‘ํ•˜๋ ค๊ณ ์ž ํ•˜๋Š” ๋‰ด์Šค์˜ ๊ธฐ๊ฐ„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ startmonth์›”๋ถ€ํ„ฐ endmonth์›”๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.

  • start()

์ด ๋ฉ”์„œ๋“œ๋Š” ํฌ๋กค๋ง ์‹คํ–‰ ๋ฉ”์„œ๋“œ์ž…๋‹ˆ๋‹ค.

Article News Crawler Example

from korea_news_crawler.articlecrawler import ArticleCrawler

Crawler = ArticleCrawler()  
Crawler.set_category("์ •์น˜", "IT๊ณผํ•™", "economy")  
Crawler.set_date_range("2017-01", "2018-04-20")
Crawler.start()

2017๋…„ 1์›” ~ 2018๋…„ 4์›” 20์ผ๊นŒ์ง€ ์ •์น˜, IT๊ณผํ•™, ๊ฒฝ์ œ ์นดํ…Œ๊ณ ๋ฆฌ ๋‰ด์Šค๋ฅผ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Sports News Crawler Example

Method๋Š” ArticleCrawler()์™€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

from korea_news_crawler.sportcrawler import SportCrawler 

Spt_crawler = SportCrawler()
Spt_crawler.set_category('ํ•œ๊ตญ์•ผ๊ตฌ','ํ•œ๊ตญ์ถ•๊ตฌ')
Spt_crawler.set_date_range("2017-01", "2018-04-20")
Spt_crawler.start()

2017๋…„ 1์›” ~ 2018๋…„ 4์›” 20์ผ๊นŒ์ง€ ํ•œ๊ตญ์•ผ๊ตฌ, ํ•œ๊ตญ์ถ•๊ตฌ ๋‰ด์Šค๋ฅผ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Results

ex_screenshot ex_screenshot

Colum A: ๊ธฐ์‚ฌ ๋‚ ์งœ & ์‹œ๊ฐ„
Colum B: ๊ธฐ์‚ฌ ์นดํ…Œ๊ณ ๋ฆฌ
Colum C: ์–ธ๋ก ์‚ฌ
Colum D: ๊ธฐ์‚ฌ ์ œ๋ชฉ
Colum E: ๊ธฐ์‚ฌ ๋ณธ๋ฌธ
Colum F: ๊ธฐ์‚ฌ ์ฃผ์†Œ
์ˆ˜์ง‘ํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋Š” csv ํ™•์žฅ์ž๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

KoreaNewsCrawler (English version)

This crawler crawles news from portal Naver
Crawlable article categories include politics, economy, lifeculture, global, IT/science, society.
In the case of sports articles, that include korea baseball, korea soccer, world baseball, world soccer, basketball, volleyball, golf, general sports, e-sports.

In the case of sports articles, you can't use sport article crawler because html form is changed. I will update sport article crawler as soon as possible.

How to install

pip install KoreaNewsCrawler

Method

  • set_category(category_name)

This method is setting categories that you want to crawl.
Categories that can be entered into parameters are politics, economy, society, living_culture, IT_science. Multiple parameters can be entered.

  • set_date_range(startyear, startmonth, endyear, endmonth)

This method represents the duration of the news you want to collect.
Data is collected from startmonth to endmonth.

  • start()

This method is the crawl execution method.

Article News Crawler Example

from korea_news_crawler.articlecrawler import ArticleCrawler

Crawler = ArticleCrawler()  
Crawler.set_category("politics", "IT_science", "economy")  
Crawler.set_date_range("2017-01", "2018-04-20") 
Crawler.start()

From January 2017 to April 20 2018, Parallel crawls will be conducted using multiprocessors for political, IT science, world, and economic category news.

Sports News Crawler Example

Method is similar to ArticleCrawler().

from korea_news_crawler.sportcrawler import SportCrawler 

Spt_crawler = SportCrawler()
Spt_crawler.set_category('korea baseball','korea soccer')
Spt_crawler.set_date_range("2017-01", "2018-04-20") 
Spt_crawler.start()

From January 2017 to April 20 2018, Parallel crawls will be conducted using multiprocessors for korea baseball, and korea soccer category news.

Results

ex_screenshot ex_screenshot

Colum A: Article Date & Time
Colum B: Article Category
Colum C: Article Press
Colum D: Article headline
Colum E: Article Content
Colum F: Article URL

All collected data is saved as a csv.