• Stars
    star
    366
  • Rank 116,547 (Top 3 %)
  • Language
    C
  • License
    MIT License
  • Created about 9 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scalable and convenient crawler framework in C:).

CSpider

A scalable and convenient crawler framework based on C:).
中文文档.

Examples

Welcome any program using cspider to add links here.

INSTALL

make
  • Move .so and .h files to relevant directory:
make install
  • Finally, you could compile your code(such as test.c) using -lcspider:
gcc -o test test.c -lcspider -I /usr/include/libxml2

using -lcspider to link dynamic link library, and -I /usr/include/libxml2 could let compiler to find libxml2's head files.

API

Initial settings

  • cspider_t *init_cspider()
    At the beginning of your code, you can get cspider_t by this function. It's essential.

  • void cs_setopt_url(cspider_t *, char *)
    Insert urls into work queue when cspider start, passing url's string to second param. url's string could not contain http:// or https://. You could call this function many times to insert many urls.

  • void cs_setopt_cookie(cspider_t *, char *)
    Passing cookies to second param, the format is var1=abc; var2=qwe. Optional.

  • void cs_setopt_useragent(cspider_t *, char *)
    Passing user agent to second param. Optional.

  • void cs_setopt_proxy(cspider_t *, char *)
    Passing proxy to second param. Optional.

  • void cs_setopt_timeout(cspider_t *, long)
    Passing timeout value(ms) to second param. Optional.

  • void cs_setopt_logfile(cspider_t *, FILE *)
    Passing file pointer to second param, which is used to print log information. Optional.

  • void cs_setopt_process(cspider_t *, void (*process)(cspider_t *, char*, char *, void *), void *)
    Passing process function to second param, and user's custom context pointer to third param. You could use this custom context in process function's fourth param and string which is downloaded in second param.

  • void cs_setopt_save(cspider_t *, void (*save)(void*, void*), void*)
    Passing data persistence function to second param, and alse user custom context pointer to third param. In this custom function, you can get pointer to prepared data in second param.

  • void cs_setopt_threadnum(cspider_t *cspider, int , int )
    Setting the number of thread. The second param could be DOWNLOAD and SAVE, which indicates two kinds of thread. The third param could be the number of thread you want to set.

  • int cs_run(cspider_t *)
    Start cspider. Using this at the end of your code.

More functions

  • void saveString(cspider_t *, void *, int)
    Using this function in custom process function could pass data pointer to custom data persistence function. In the third param, you could use LOCK if you need thread safety. For exmaple, multi-thread write to same file. You could alse use NO_LOCK, if you don't need thread safety.

  • void saveStrings(cspider_t *, void **, int, int)
    Using this function could pass array of data pointer to custom data persistence function. Third param is the size of array. Fourth param could be LOCK and NO_LOCK.

  • void addUrl(cspider_t *cspider, char *url)
    Using this function in custom process function could insert url to work queue.

  • void addUrls(cspider_t *cspider, char *urls, int size)
    Using this function to insert many urls back to work queue. The third param is the number of urls which you want to insert.

  • void freeString(char *)
    Using this function to free string.

  • void freeStrings(char **, int)
    Using this function to free array of string. Second param is the size of array. You could use this function after regexAll and xpath to free the string array you get.

Tools

  1. Regular expressions:

    • int regexAll(const char *regex, char *str, char **res, int num, int flag);
      regex : regular matching rule
      str : the source string.
      res : array used for saving strings which is matched.
      num : array's size.
      flag : it could be REGEX_ALL and REGEX_NO_ALL, which means whether to return the whole string.
      This function returns the number of matched strings.

    • int match(char *regex, char *str);
      Whether it matches. Return 1 for yes, 0 for no.

  2. Using xpath to deal with html and xml:

    • int xpath(char *xml, char *path, char **res, int num);
      xml : prepared to parse.
      path : xpath's rule.
      res : array used for saving strings.
      num : array's size.
      This function returns the number of array which we get.
  3. Json:

    cspider contains cJSON. We could use it to parse json data. Usage is here

  4. Uriparser:

    • void joinall(char *baseuri, char **uris, int size); -> join all uris relative to baseuri

    baseuri: Base uri ( char *url in process func ) uris: regex / xpath extracted urls size: length of uris

    • char * join(char *baseuri, char *rel) -> join relative string to the base string

    baseuri: Base uri ( http://test.com/ ) rel: Relative url ( /a/b || ./a/b || ../a/b/./ and ... )

After regexAll and xpath, you should use freeStrings to free the string array which you get.

Example

Print the Github's main page's source code.

#include<cspider/spider.h>
/*
	custom process function
*/
void p(cspider_t *cspider, char *d, char *url, void *user_data) {

  printf("url -> %s\n", url);
  saveString(cspider, d, LOCK);
  
}
/*
	custom data persistence function
*/
void s(void *str, void *user_data) {
  char *get = (char *)str;
  FILE *file = (FILE*)user_data;
  fprintf(file, "%s\n", get);
  return;
}

int main() {
  cspider_t *spider = init_cspider(); 
  char *agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0";
  
  cs_setopt_url(spider, "github.com");
  
  cs_setopt_useragent(spider, agent);
  //
  cs_setopt_process(spider, p, NULL);
  cs_setopt_save(spider, s, stdout);
  //set the thread's number
  cs_setopt_threadnum(spider, DOWNLOAD, 2);
  cs_setopt_threadnum(spider, SAVE, 2);
  
  return cs_run(spider);
}

More Repositories

1

Chinese-uvbook

翻译的libuv的中文教程
C
1,482
star
2

jlitespider

A lite distributed Java spider framework :-)
Java
149
star
3

MyPaxos

My multi-paxos service implement :-)
Java
138
star
4

libel

An event-driven library.
C
70
star
5

Dior

A kind of Lisp
C
55
star
6

MyBlog

I want to write something fun...
25
star
7

LightComm4J

Yet another lightweight asynchronous network library for java
Java
24
star
8

batboy-mini-kernel

一个简单操作系统,仅供玩耍
C
21
star
9

algs4-Programming-Assignment

coursera上的普林斯顿大学算法课程的编程习题解答
Java
10
star
10

lambda-calculus

lambda calculus interpreter
Scheme
9
star
11

Lockfree

Lockfree implements
C++
8
star
12

DoMi

一个自己实现的好玩的脚本语言
C
7
star
13

FreshSoccer

新鲜足球android客户端v1.1
Java
6
star
14

Eior

A compiler for Eior which is just like scheme
Scheme
4
star
15

LegoKV

Distributed and pluginable KV store just like Lego toys :)
C++
4
star
16

lets-chat

A IM web application using socket.io to build it
JavaScript
4
star
17

Libapp_v1

An android app for share books
Java
3
star
18

JLiteSpider-Demo

Some JLiteSpider's demo crawlers
Java
1
star
19

ml-luoyixin

机器学习斯坦福公开课作业
MATLAB
1
star
20

DoubanRenEr

计划电影android客户端
Java
1
star
21

LuoReactDemo

用react和node写的简单的web小项目,学习用的
JavaScript
1
star
22

LuohahaTestCode

我的杂货库,各种乱七八糟的东西和代码
Python
1
star
23

c-funs

Some useful and funny things.
C
1
star
24

luohaha.github.io

自己写的博客
1
star
25

LuoORM

Yet another ORM framework
Java
1
star
26

some-interpreters

Just for fun!!!
Scheme
1
star
27

VideoWebCrawlerByWebMagic

使用webmagic框架写的一些爬虫
Java
1
star
28

Android-Component-By-Lyx

收集一些自己实现的,或者从网上学到的好的android部件,将他们汇集起来。
Java
1
star