Projects tagged ‘webcrawler’

Heritrix: Internet Archive Web Crawler

H

Analyzed about 22 hours ago

The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

76K lines of code

11 current contributors

15 days since last commit

10 users on Open Hub

Moderate Activity

0 Reviews

I Use This

Mostly written in Java

Licenses: lgpl

Tags webcrawler

crawler4j

C

Analyzed about 11 hours ago

Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and ... [More]

8.29K lines of code

5 current contributors

about 5 years since last commit

4 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Java

Licenses: apache_2

Tags crawler java multi-threaded opensource web webcrawler

LinkChecker

Analyzed over 1 year ago

Check websites and HTML documents for broken links. * recursive and multithreaded checking * output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats * HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support * restriction of link ... [More]

45.2K lines of code

10 current contributors

over 1 year since last commit

3 users on Open Hub

Activity Not Available

0 Reviews

I Use This

Mostly written in Python

Licenses: gpl

Tags aref crawler html link-checker link_checking loadtest robot spider w3c web-crawler webcrawler web-spider 1 more...

Smart Cache Loader

Analyzed about 12 hours ago

Smart Cache Loader is most configurable web batch downloader in world! If you have a very specific needs to grab some portions of web site -- this is right tool for you! This program can be also used as web crawler if you need to crawl defined parts of www site(s).

5.28K lines of code

0 current contributors

over 5 years since last commit

2 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Java

Licenses: gpl

Tags downloader download_manager java scraper webcrawler

Spidr

Analyzed about 14 hours ago

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

4.39K lines of code

1 current contributors

4 months since last commit

1 users on Open Hub

Very Low Activity

0 Reviews

I Use This

Mostly written in Ruby

Licenses: gpl, mit

Tags crawler rdoc robot rspec ruby rubygem spider spidr web web-crawler webcrawler web-spider 1 more...

Anemone

Analyzed about 9 hours ago

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone ... [More]

2.15K lines of code

0 current contributors

over 13 years since last commit

1 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Ruby

Licenses: mit

Tags crawler rdoc ruby spider web web-crawler webcrawler web-spider webspider

mediacloud_backend

M

Analyzed about 11 hours ago

MediaCloud backend repository

23K lines of code

0 current contributors

almost 10 years since last commit

0 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in JavaScript

Licenses: No declared licenses

Tags analytics celery datascience media nlp python rss webcrawler

Tachyon_project

T

Analyzed 1 day ago

Tachyon is a fast web application security reconnaissance tool. It is specifically meant to crawl a web application and look for left over or non-indexed files with the addition of reporting pages or scripts leaking internal data.

2.32K lines of code

0 current contributors

3 months since last commit

0 users on Open Hub

Very Low Activity

0 Reviews

I Use This

Mostly written in Python

Licenses: GNU-GPLv2

Tags finder finders http python reconnaissance Security securitymanager webapplication webcrawl webcrawler webcrawlerjava webcrawlers 1 more...

heritrix-crawl-filter

H

Analyzed 1 day ago

A candidate chain processor for applying regular expression filters on URIs coming from defined seeds

211 lines of code

0 current contributors

about 15 years since last commit

0 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Java

Licenses: No declared licenses

Tags crawler webcrawler

Zeitcrawler

Z

Analyzed about 6 hours ago

A specialized crawler for the German newspaper 'Die Zeit'. Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text ... [More]

1.64K lines of code

0 current contributors

over 11 years since last commit

0 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Perl

Licenses: gpl3

Tags academic computational_linguistics corpus corpus_linguistics crawler digital_humanities natural_language_processing nlp perl unix webcrawler xml

Tags : Browse Projects