• Stars
    star
    327
  • Rank 128,686 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 12 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Mini website crawler to make sitemap from a website.

Python-Sitemap

Simple script to crawl websites and create a sitemap.xml of all public link in it.

Warning : This script only works with Python3

Simple usage

>>> python main.py --domain http://blog.lesite.us --output sitemap.xml

Advanced usage

Read a config file to set parameters: You can overide (or add for list) any parameters define in the config.json

>>> python main.py --config config/config.json

Enable debug:

  $ python main.py --domain https://blog.lesite.us --output sitemap.xml --debug

Enable verbose output:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --verbose

Enable Image Sitemap

More informations here https://support.google.com/webmasters/answer/178636?hl=en

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --images

Enable report for print summary of the crawl:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --report

Skip url (by extension) (skip pdf AND xml url):

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --skipext pdf --skipext xml

Drop a part of an url via regexp :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --drop "id=[0-9]{5}"

Exclude url by filter a part of it :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --exclude "action=edit"

Read the robots.txt to ignore some url:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --parserobots

Human readable XML

$ python3 main.py --domain https://blog.lesite.us --images --parserobots | xmllint --format -

Multithreaded

$ python3 main.py --domain https://blog.lesite.us --num-workers 4

with basic auth

You need to configure username and password in your config.py before

$ python3 main.py --domain https://blog.lesite.us --auth

Output sitemap index file

Sitemaps with over 50,000 URLs should be split into an index file that points to sitemap files that each contain 50,000 URLs or fewer. Outputting as an index requires specifying an output file. An index will only be output if a crawl has more than 50,000 URLs:

$ python3 main.py --domain https://blog.lesite.us --as-index --output sitemap.xml

Docker usage

Build the Docker image:

$ docker build -t python-sitemap:latest .

Run with default domain :

$ docker run -it python-sitemap

Run with custom domain :

$ docker run -it python-sitemap --domain https://www.graylog.fr

Run with config file and output :

You need to configure config.json file before

$ docker run -it -v `pwd`/config/:/config/ -v `pwd`:/home/python-sitemap/ python-sitemap --config config/config.json

More Repositories

1

pi-as-keyboard

Make your Raspberry act as a Keyboard
C
118
star
2

vuejs-cordova-sample

Sample VueJS app thats use Cordova capabilities
Java
62
star
3

raspberry-rtlsdr-server

Quick guide to setup a RTL SDR Server on a Raspberry Pi.
43
star
4

bepo_developpeur

Quelques modifications de la disposition clavier Bépo pour la rendre plus adaptée à mon usage. Update Bépo V1.1
Shell
20
star
5

bts-sio

Support de cours & TP pour le BTS SIO
JavaScript
12
star
6

vuetify-vuejs-confirmDialog

Vuetify + VueJS Confirm Component
JavaScript
9
star
7

rpi-docker-lamp-stack

Multi-User Docker LAMP Stack for Raspberry Pi (32 bit)
PHP
8
star
8

YASB

YASB - A Simple Static Website Generator
Python
7
star
9

vuetify-vuejs-firebaseUploader

A dead simple Vue.js component to upload file to Firebase Storage
Vue
6
star
10

android-kiosk-mode-app

Simple Kiosk App sample code.
Java
6
star
11

browser-js-error-icon

Little icon to remind you that something gone wrong on the current tab.
JavaScript
5
star
12

basehttpserver-wrapper

Simple layer to simplify the usage of BaseHTTPServer. Useful for simple API in embed project.
Python
5
star
13

firebase-realtime-maps-sample

Realtime Leaflet Maps + Firebase marker sync
Vue
5
star
14

typematrix

Replace the original Typematrix 2030 controller by the Teensy 2.0++
5
star
15

vetronic-esphome

Contrôle à distance de la borne VE-TRONIC (via ESPHOME + RS232)
C++
4
star
16

simple_remote_gui

Python script for generating a simple menu.
Python
3
star
17

WireGuard-cli

CLI helper to handle the basic WireGuard configuration stuff - Wireguard Config Generator
Shell
3
star
18

pyIvr

Generate IVR (or webpage, or anything else) from a json.
Python
3
star
19

menubar-quick-access

Simple customisable MenuBar for OSX
Python
3
star
20

xd75re

Ressources for build and customizing the XD75Re
HTML
3
star
21

mini-mvc-sample

Structure est réalisée à des fins pédagogiques. Intermédiaire permettant d'introduire les concepts du framework Laravel sur des bases de développement PHP connu.
PHP
3
star
22

gitbook-plugin-click-reveal

Add click to reveal in your Gitbook
JavaScript
2
star
23

pass-type

A pass extension for « typing » instead of copy/pasting
Roff
2
star
24

cordova-docker

Docker image to build (and test) Cordova app
Shell
2
star
25

string2hid

Python script to convert String / text to HID keycode
Python
2
star
26

pi-motion-detection

Test script to detect motion on camera feed
Python
2
star
27

gnome-shell-xkbswitcher

Xkbswitcher, a Gnome-Shell extension to easily switch between xkb files located in the user home directory
JavaScript
2
star
28

termux-tilling-client

Simple tmux tilling client
Shell
1
star
29

api-todo-flask

Flask API for the sample Todo (used to illustrate a course)
Python
1
star
30

Cliff-Height-Timer-VueJS

VueJS version of Cliff Height Timer
JavaScript
1
star
31

laurence-bot

Un simple bot pour Mattermost / Telegram
Python
1
star
32

dotfiles

Shell
1
star
33

flutter-list-sample

C++
1
star
34

cso

CSO is a simple Centralized Sign-on (using to your internal LDAP)
Python
1
star
35

pxycache

PxyCache is a simple Proxy, With offline capacities.
Python
1
star
36

cheatsheet

Aide mémoire pour différents outils
1
star
37

thermal-hue

Use your Hue Motion Sensor as thermal sensor
JavaScript
1
star
38

vue-yasb

Experimental Markdown and VueJS only with Browser ESM.
JavaScript
1
star
39

btssio-e5-timer

Vue
1
star
40

Android-Boilerplate-Koin-CoRoutines-OkHTTP

Android Boilerplate project that use Kotlin, Koin, MVVM, Retrofit, OkHTTP, Coroutines
Kotlin
1
star
41

tweet_planner

Python script to tweet automatically a queue of « message » to your twitter Account.
Python
1
star
42

raspberry-pi-hid-proxy

USB to USB with a Raspberry Pi Zero (HID Proxy)
Shell
1
star
43

simpleMouse

The most simple way to add support of mouse with buttons 4 & 5 in OSX
Swift
1
star
44

KWM-Menubar-Status-Indicator

Simple Menubar Status Indicator for KWM
Python
1
star