• Stars
    star
    121
  • Rank 292,296 (Top 6 %)
  • Language
    Python
  • Created almost 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a spider for cnki patent content, just for study and commucation, no use for business.

CNKISpider

知网专利爬虫,仅用于学习交流,不做商业使用

发现新的爬取入口

今天同学突然告诉我爬取了100多W(我们需要爬2014年的,总共190W+),细问才知道,知网的专利详情页的url组成是有规则的。

举个例子:

http://dbpub.cnki.net/grid2008/dbpub/Detail.aspx?DBName=SCPD2014&FileName=CN203968251U&QueryID=28&CurRec=2

对于这个某个专利的url来说,我们只要变化FileName=CN203968251U就可以了,=号后面代表的是专利公开号,专利公开号亦称专利文献号,组成方式为“国别号+分类号+流水号+标识代码”,如CN1340998A,表示中国的第340998号发明专利(来自百度百科)。

假如我们需要爬取2014年的所有专利,我们可以通过搜索找到2014年1月1日(2014年非常早的一篇专利号)和2014年12月31日(2014年非常晚的一篇专利号),取中间的差值,就可以爬取绝大部分需要的专利了。

其中,CN是固定的,末尾的字母是专利标识代码,中国只有ASU种

所有,避免了爬取url列表页(反爬虫严重)和复杂的验证码问题,直接构建循环爬取详情页即可。

项目使用工具

框架使用的的Scrapy1.3,python版本3.6

More Repositories

1

choice

my graduated programmer work, a Postgraduate entrance examination school intelligent recommendation system, based on simple machine algorithms, and achieve a number of simple machine learning algorithms
Java
124
star
2

SinaWeiboSpider

A web spider for Sina Weibo, based on Scrapy framework and mongodb database.
Python
110
star
3

ML-IN-ACTION

The codes I code for the book 《Machine Learning In Action》,and I revise the error in the book to confirm the codes run successfully.
Python
96
star
4

TaiWanML

The reading notes about the course of 《The basic of machine learning》 by Hung-yi Lee,National Taiwan University. Learn from many blogs on web.
91
star
5

KaggleOrOthersJourney

The notes of Alibaba TianChi and Kaggle competitons, including codes and experiences
Python
86
star
6

seu-thesis-latex-template

东南大学毕业论文latex模板
TeX
83
star
7

AngelEyes

a system for searching information about losing children. you can find text information and upload images to achieve the similarity with the images in the system database so that you can be sure if this children is the person who you want to find.
Java
54
star
8

PythonForDataAnalysis

利用Python进行数据分析学习笔记以及自己寻找数据集做的练习
Jupyter Notebook
44
star
9

tensquare

微服务实战项目,包括服务发现、服务调用、熔断器、服务网关、集中配置、消息总线、容器部署、容器监控等
Java
12
star
10

100-papers-recodes-plan

In order to better understand the paper and exercise own code ability
Jupyter Notebook
4
star
11

MySQLForOptimization

数据库优化策略
3
star
12

DrugBank-csv

DrugBank csv format data and python process demo
Python
3
star
13

MyLaTeX

LaTeX练习和笔记
2
star
14

wen-fei.github.io

home pages for writting
CSS
2
star
15

Biomedical-Dataset

Biomedical related dataset, include umls、DrugBank、UniProtKB, et al.
1
star
16

deep-spring

deep spring、spring-boot
Java
1
star
17

mall

电商项目
Java
1
star
18

CEGNN

cross-graph embedding based on mutil-view random walk and multimodal feature
Jupyter Notebook
1
star
19

GEOM

Building Knowledge Graph based PICO for ERM
Python
1
star
20

jvm-in-go

a toy jvm in Go
Go
1
star
21

CodeMask

面码,互联网知识记忆辅助器
Java
1
star
22

elderly-acmer

a rod for elderly acmer
C++
1
star
23

NetsPytorch

Pytorch implementation of classic networks
Python
1
star
24

old-acmer

a road for old acmer
1
star
25

myhadoop

hadoop for partices
Java
1
star
26

jvm-gctuning-guide

Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide 翻译版
1
star