• Stars
    star
    3,753
  • Rank 11,749 (Top 0.3 %)
  • Language
    C#
  • License
    MIT License
  • Created over 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

DotnetSpider

免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.

If you want to get the latest beta packages, you should add the myget feed:

<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />

DESIGN

DESIGN IMAGE

DEVELOP ENVIROMENT

  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
    
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
    
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
    
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
    
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
    
  9. RabbitMQ

    docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \
           -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \
           rabbitmq:3-management
    
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
    
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
    

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider
{
    public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
        options, services, logger)
    {
    }

    #region Nested type: CnblogsEntry

    [Schema("cnblogs", "news")]
    [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
    [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
    [FollowRequestSelector(XPaths = new[]
    {
        "//div[@class='pager']"
    })]
    public class CnblogsEntry : EntityBase<CnblogsEntry>
    {
        public int Id { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
        public string Category { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
        public string WebSite { get; set; }

        [StringLength(200)]
        [ValueSelector(Expression = "//title")]
        [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
        public string Title { get; set; }

        [StringLength(40)]
        [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
        public string Guid { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
        public string News { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
        public string Url { get; set; }

        [ValueSelector(Expression = ".//div[@class='entry_summary']")]
        public string PlainText { get; set; }

        [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
        public DateTime CreationTime { get; set; }

        protected override void Configure()
        {
            HasIndex(x => x.Title);
            HasIndex(x => new
            {
                x.WebSite,
                x.Guid
            }, true);
        }
    }

    #endregion

    public static async Task RunAsync()
    {
        var builder = Builder.CreateDefaultBuilder<EntitySpider>();
        builder.UseSerilog();
        builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
        await builder.Build()
            .RunAsync();
    }

    protected override async Task InitializeAsync(CancellationToken stoppingToken)
    {
        AddDataFlow(new DataParser<CnblogsEntry>());
        AddDataFlow(GetDefaultStorage());
        await AddRequestsAsync(new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }), new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }));
    }

    protected override (string Id, string Name) GetIdAndName()
    {
        return (ObjectId.NewId.ToString(), "博客园");
    }
}

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Dependencies

Package License
Bert.RateLimiters Apache 2.0
MessagePack MIT
Newtonsoft.Json MIT
Dapper Apache 2.0
HtmlAgilityPack MIT
ZCJ.HashedWheelTimer MIT
murmurhash Apache 2.0
Serilog.AspNetCore Apache 2.0
Serilog.Sinks.Console Apache 2.0
Serilog.Sinks.RollingFile Apache 2.0
Serilog.Sinks.PeriodicBatching Apache 2.0
MongoDB.Driver Apache 2.0
MySqlConnector MIT
AutoMapper.Extensions.Microsoft.DependencyInjection MIT
Docker.DotNet MIT
BuildBundlerMinifier Apache 2.0
Pomelo.EntityFrameworkCore.MySql MIT
Quartz.AspNetCore Apache 2.0
Quartz.AspNetCore.MySqlConnector Apache 2.0
Npgsql PostgreSQL License
RabbitMQ.Client Apache 2.0
Polly BSD 3-C

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: [email protected]

More Repositories

1

FastGithub

github加速神器,解决github打不开、用户头像无法加载、releases无法上传下载、git-clone、git-pull、git-push失败等问题
C#
13,273
star
2

CAP

Distributed transaction solution in micro-service base on eventually consistency, also an eventbus with Outbox pattern
C#
6,267
star
3

Util

Util是一个.Net平台下的应用框架,旨在提升中小团队的开发能力,由工具类、分层架构基类、Ui组件,配套代码生成模板,权限等组成。
C#
4,306
star
4

WTM

Use WTM to write .netcore app fast !!!
C#
4,234
star
5

FreeSql

🦄 .NET aot orm, C# orm, VB.NET orm, Mysql orm, Postgresql orm, SqlServer orm, Oracle orm, Sqlite orm, Firebird orm, 达梦 orm, 人大金仓 orm, 神通 orm, 翰高 orm, 南大通用 orm, 虚谷 orm, 国产 orm, Clickhouse orm, QuestDB orm, MsAccess orm.
C#
4,071
star
6

osharp

OSharp是一个基于.Net6.0的快速开发框架,框架对 AspNetCore 的配置、依赖注入、日志、缓存、实体框架、Mvc(WebApi)、身份认证、功能权限、数据权限等模块进行更高一级的自动化封装,并规范了一套业务实现的代码结构与操作流程,使 .Net 框架更易于应用到实际项目开发中。
C#
2,758
star
7

BootstrapBlazor

Bootstrap Blazor is an enterprise-level UI component library based on Bootstrap and Blazor.
C#
2,492
star
8

Magicodes.IE

Import and export general library, support Dto import and export, template export, fancy export and dynamic export, support Excel, Csv, Word, Pdf and Html.
C#
2,074
star
9

WebApiClient

A REST API library with better functionality, performance, and scalability than refit
C#
2,047
star
10

NPOI

A .NET library for reading and writing Microsoft Office binary and OOXML file formats.
C#
1,877
star
11

EasyCaching

💥 EasyCaching is an open source caching library that contains basic usages and some advanced usages of caching which can help us to handle caching more easier!
C#
1,736
star
12

AspectCore-Framework

AspectCore is an AOP-based cross platform framework for .NET Standard.
C#
1,684
star
13

AgileConfig

基于.NET Core开发的轻量级分布式配置中心 / .NET Core lightweight configuration server
C#
1,483
star
14

Natasha

基于 Roslyn 的 C# 动态程序集构建库,该库允许开发者在运行时使用 C# 代码构建域 / 程序集 / 类 / 结构体 / 枚举 / 接口 / 方法等,使得程序在运行的时候可以增加新的模块及功能。Natasha 集成了域管理/插件管理,可以实现域隔离,域卸载,热拔插等功能。 该库遵循完整的编译流程,提供完整的错误提示, 可自动添加引用,完善的数据结构构建模板让开发者只专注于程序集脚本的编写,兼容 stanadard2.0 / netcoreapp3.0+, 跨平台,统一、简便的链式 API。 且我们会尽快修复您的问题及回复您的 issue.
C#
1,449
star
15

HttpReports

HttpReports is an APM (application performance monitor) system for .Net Core.
C#
1,260
star
16

sharding-core

high performance lightweight solution for efcore sharding table and sharding database support read-write-separation .一款ef-core下高性能、轻量级针对分表分库读写分离的解决方案,具有零依赖、零学习成本、零业务代码入侵
C#
1,142
star
17

SmartSql

SmartSql = MyBatis in C# + .NET Core+ Cache(Memory | Redis) + R/W Splitting + PropertyChangedTrack +Dynamic Repository + InvokeSync + Diagnostics
C#
1,098
star
18

FlubuCore

A cross platform build and deployment automation system for building projects and executing deployment scripts using C# code.
C#
907
star
19

Alipay.AopSdk.Core

支付宝(Alipay)服务端SDK,采用.NET Standard 2.0,支持.NET Core >=2.0,与官方SDK接口完全相同。完全可以按照官方文档进行开发。除了支持支付以外,官方SDK支持的功能本SDK全部支持,比如生活号、服务窗、行业合作等,且用法几乎一样,代码都可参考官方文档代码。
C#
778
star
20

SmartCode

SmartCode = IDataSource -> IBuildTask -> IOutput => Build Everything!!!
C#
572
star
21

CanalSharp

Alibaba mysql database binlog subscription & consumer components Canal's .NET client.
C#
559
star
22

aspnetcore-doc-cn

The Simplified Chinese edition of Microsoft ASP.NET Core documentation, translated by .NET Core Community and .NET China Community.
C#
521
star
23

Home

Home repo of .NET Core Community
299
star
24

mocha

Mocha is an application performance monitor tools based on OpenTelemetry, which also provides a scalable platform for observability data analysis and storage.
C#
142
star
25

Collections

Utilities and extensions for Collections includes Collections.Paginable and so on...
C#
88
star
26

EntityFrameworkCore.KingbaseES

Entity Framework Core provider for KingbaseES Database
C#
45
star
27

EntityFrameworkCore.GaussDB

Entity Framework Core provider for GaussDB Database
C#
32
star
28

FlubuCore.Examples

Examples for FlubuCore - a cross platform build automation tool for building projects and executing deployment scripts using C# code.
C#
32
star
29

wind-rises

25
star
30

Compile.Environment

When using the Roslyn library for dynamic compilation, you can introduce the library to provide a dynamic compilation environment.
10
star
31

SourceLink.Environment

Provide an inheritable NuGet package for the SourceLink feature.
7
star
32

projects

This repository is the site of NCC Projects include both Top-Level projects and Sandbox projects.
CSS
5
star
33

Natasha.Docs

The document for Natasha
JavaScript
4
star
34

DotNetCore.GaussDB

It's the foundation of DotNetCore.EntityFrameworkCore.GaussDB
C#
3
star
35

dotnetcore.github.io

.NET Core Community Official WebSite
HTML
2
star