wen-fei/CNKISpider

Stars
121
Rank 293,924 (Top 6 %)
Language
Python
Created about 7 years ago
Updated almost 7 years ago

wen-fei/CNKISpider

wen-fei

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

a spider for cnki patent content, just for study and commucation, no use for business.

CNKISpider

知网专利爬虫，仅用于学习交流，不做商业使用

发现新的爬取入口

今天同学突然告诉我爬取了100多W（我们需要爬2014年的，总共190W+），细问才知道，知网的专利详情页的url组成是有规则的。

举个例子：

http://dbpub.cnki.net/grid2008/dbpub/Detail.aspx?DBName=SCPD2014&FileName=CN203968251U&QueryID=28&CurRec=2

对于这个某个专利的url来说，我们只要变化FileName=CN203968251U就可以了，=号后面代表的是专利公开号，专利公开号亦称专利文献号，组成方式为“国别号+分类号+流水号+标识代码”，如CN1340998A，表示中国的第340998号发明专利（来自百度百科）。

假如我们需要爬取2014年的所有专利，我们可以通过搜索找到2014年1月1日（2014年非常早的一篇专利号）和2014年12月31日（2014年非常晚的一篇专利号），取中间的差值，就可以爬取绝大部分需要的专利了。

其中，CN是固定的，末尾的字母是专利标识代码，中国只有ASU种

所有，避免了爬取url列表页（反爬虫严重）和复杂的验证码问题，直接构建循环爬取详情页即可。

项目使用工具

框架使用的的Scrapy1.3，python版本3.6

choice

my graduated programmer work, a Postgraduate entrance examination school intelligent recommendation system, based on simple machine algorithms, and achieve a number of simple machine learning algorithms

SinaWeiboSpider

A web spider for Sina Weibo, based on Scrapy framework and mongodb database.

ML-IN-ACTION

The codes I code for the book 《Machine Learning In Action》，and I revise the error in the book to confirm the codes run successfully.

TaiWanML

The reading notes about the course of 《The basic of machine learning》 by Hung-yi Lee，National Taiwan University. Learn from many blogs on web.

KaggleOrOthersJourney

The notes of Alibaba TianChi and Kaggle competitons, including codes and experiences

seu-thesis-latex-template

东南大学毕业论文latex模板

AngelEyes

a system for searching information about losing children. you can find text information and upload images to achieve the similarity with the images in the system database so that you can be sure if this children is the person who you want to find.

PythonForDataAnalysis

利用Python进行数据分析学习笔记以及自己寻找数据集做的练习

Jupyter Notebook

tensquare

微服务实战项目，包括服务发现、服务调用、熔断器、服务网关、集中配置、消息总线、容器部署、容器监控等

100-papers-recodes-plan

In order to better understand the paper and exercise own code ability

Jupyter Notebook

DrugBank-csv

DrugBank csv format data and python process demo

MySQLForOptimization

数据库优化策略

MyLaTeX

LaTeX练习和笔记

wen-fei.github.io

home pages for writting

Biomedical-Dataset

Biomedical related dataset, include umls、DrugBank、UniProtKB, et al.

deep-spring

deep spring、spring-boot

mall

CEGNN

cross-graph embedding based on mutil-view random walk and multimodal feature

Jupyter Notebook

GEOM

Building Knowledge Graph based PICO for ERM

jvm-in-go

a toy jvm in Go

CodeMask

面码，互联网知识记忆辅助器

elderly-acmer

a rod for elderly acmer

NetsPytorch

Pytorch implementation of classic networks

old-acmer

a road for old acmer

myhadoop

hadoop for partices

jvm-gctuning-guide

Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide 翻译版