• Stars
    star
    109
  • Rank 310,653 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple GoIndex Downloader

Recursive GoIndex Downloader by atlonxp

Features

  • Recursive crawler (atlonxp)
  • Download all folders and files in a given url (atlonxp)
  • Download all folders and files in in sub-folders (atlonxp)
  • Adaptive delay in fetching url (atlonxp)
  • Store folders/files directly to your Google Drive (pankaj260)
  • Folders and files exclusion filters (atlonxp)
  • Download queue supported (atlonxp)
  • Auto-domain URL detection (atlonxp)
  • API-based GoIndex crawler (atlonxp, ifvv)
  • Parallel/Multiple files downloader (atlonxp)
  • Auto-skip password-protected folders (cxu-fork)

Upcoming

  • parallel crawlers

Version 2:

API-based crawler with parallel files downloader

28 April 2020 (v2.4.0)

+ added feature: curl download mode as default (we found sometime, requests.get caused a corrupted file)
+ added feature: file size check. If not the same in the metadata, we force download
+ added feature: double file size check. Once a file is downloaded, we re-check the it size with the metdadata
+ revised time delay while crawling and downloading 
+ fixed major bugs when checking file size

26 April 2020 (v2.3.3)

+ added downloaded size information

22 April 2020 (v2.3.2)
---------------------

+ added summary
+ added Exception when file is unable to download

21 April 2020 (v2.3.1)
---------------------
While crawling, fetching might cause errors sometime due to some quick requests or server is busy.
This problem has caused the eror in getting a json, so we re-fetch the url again (up to MAX_RETRY_CRAWLING)
or until we found key "files" in the return response. Once retries is reached the maximum and
the key "files" is not found, so we ignore this link (return [])

At the end, if you find there is failure, just re-run the download section again. Unless you set
OVERWITE = TRUE, all files will be re-downloaded

+ added MAX_RETRY_CRAWLING (v2.3)
+ fixed FILE_EXISTING_CHECK (stupid) bug
+ added failure-links download task

20 April 2020 (v2.2)
---------------------
Some sub-folders may be password-protected which will cause the error while crawling, so we skip this folder

+ added auto-skip password-protected folder

17 April 2020 (v2.1)
---------------------
+ fixed URL duplicated when crawling
+ added search 'files' key for some websites do not have proper files structure. So, we search it\

16 April 2020 (v2.0)
---------------------
+ crawler_v2:
	* API-based GoIndex crawler
	* Collecting all urls to be downloaded
+ parallel downloader
	* TDQM progress bar

Version 1:

Simple HTTP-based crawler and simple series downloader

Version 1 was created and improved by adapting the code from pankaj260 https://colab.research.google.com/drive/1tmsLGuswIZIZ_oM35EMW8TbJ6pQPt1rY#scrollTo=3bCnUMUg_SoT&forceEdit=true&sandboxMode=true

15 April 2020 (v1.1)
---------------------
-   Added auto-domain URL detection
-   Added simple download queue

14 April 2020 (v1.0)
---------------------
-   initial