• Stars
    star
    4
  • Rank 3,304,323 (Top 66 %)
  • Language Roff
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Study about preserving R&D website projects. Methods to automatically identify these URLs.

More Repositories

1

pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
Java
39
star
2

SafeImage

Tool to identify images with not Suitable For Work (NSFW) content
Python
7
star
3

httrack2arc

HTTrack2Arc is a tool that converts crawls made by HTTrack to Internet Archive ARC files.
Java
7
star
4

arquivo-webapp

Arquivo.pt home page web application - no longer used
CSS
4
star
5

image-search-indexing

Image Search Indexing over web archived images using Apache Solr indexes.
Java
2
star
6

scripts

Scripts for the maintenance of the Portuguese web archive
Shell
2
star
7

BrozzlerAdmin

Simple UI Interface to launch Brozzler Jobs internally
Python
2
star
8

image-search-api

SOLR imagesearch API repository
Java
1
star
9

QAReplayProxy

QA Replay Measurement Proxy
Python
1
star
10

arquivo404

Soft 404
JavaScript
1
star
11

crawl-seeds

Roff
1
star
12

PwaProcessor

Processing archived files in the ARC format.
Java
1
star
13

dspace-link-extractor

Extracts links from DSpace repositories
Java
1
star
14

page-search

Arquivo.pt Page Search System
Java
1
star
15

CitationSaver

Repository containing the service to extract URLs from PDFs or Text
Python
1
star