Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Heritrix: Internet Archive Web Crawler

Compare

  Analyzed 25 days ago

The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

335K lines of code

9 current contributors

26 days since last commit

10 users on Open Hub

Moderate Activity
4.33333
   
I Use This
Tags webcrawler

Ex-Crawler

Compare

  Analyzed 7 months ago

Ex-Crawler Project is divided into three subprojects: The main part is the Ex-Crawler daemon server, a highly configurable, flexible (Web-) crawler written in Java. It comes with it's own socket server, where you can manage the server, own usermanagement, distributed grid / volunteer computing ... [More] and much more. Crawled informations are stored in Database. Currently MySQL, PostgreSQL and MSSQL are supported. The graphical (Java Swing) distributed grid / volunteer computing client, including pc idling detection and much more. The web search engine written in PHP. It comes with a CMS, multi language detection and support, templates using smarty. And an application framework partly forked from joomla, so that joomla components could be adapted fast. [Less]

72.1K lines of code

0 current contributors

over 6 years since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This

Smart Cache Loader

Compare

  Analyzed 15 days ago

Smart Cache Loader is most configurable web batch downloader in world! If you have a very specific needs to grab some portions of web site -- this is right tool for you! This program can be also used as web crawler if you need to crawl defined parts of www site(s).

5.33K lines of code

0 current contributors

over 7 years since last commit

2 users on Open Hub

Inactive
4.0
   
I Use This

LinkChecker

Compare

  Analyzed 16 days ago

Check websites and HTML documents for broken links. * recursive and multithreaded checking * output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats * HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support * restriction of link ... [More] checking with regular expression filters for URLs * proxy support * username/password authorization for HTTP and FTP and Telnet [Less]

101 lines of code

0 current contributors

almost 5 years since last commit

2 users on Open Hub

Inactive
3.0
   
I Use This

WenChuan

Compare

  Analyzed about 3 years ago

WebChuan is a set of open source libraries and tools for getting and parsing web pages of website. It is written in Python, based on Twisted and lxml. It is inspired by GStreamer. WebChuan is designed to be back-end of web-bot, it is easy to use, powerful, flexible, reusable and efficient.

628 lines of code

0 current contributors

about 9 years since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This

Spidr

Compare

  Analyzed 13 days ago

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

3.85K lines of code

1 current contributors

8 months since last commit

1 users on Open Hub

Very Low Activity
0.0
 
I Use This

Anemone

Compare

  Analyzed 22 days ago

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone ... [More] fast. The API makes it simple. And the expressiveness of Ruby makes it powerful. [Less]

2.11K lines of code

0 current contributors

over 5 years since last commit

1 users on Open Hub

Inactive
5.0
 
I Use This

German Political Speeches Corpus-Builder

Compare

  Analyzed 22 days ago

Tools to crawl German official speeches repositories in order to gather a corpus. More information to come. A complete version of the corpus including a visualization tool is available here : http://purl.org/corpus/german-speeches

1.08K lines of code

0 current contributors

about 4 years since last commit

0 users on Open Hub

Inactive
0.0
 
I Use This

Microblog Explorer

Compare

  Analyzed 10 days ago

Perform crawls of social networks (identi.ca, reddit) to gather internal and external links and identify their language.

1.04K lines of code

0 current contributors

over 4 years since last commit

0 users on Open Hub

Inactive
0.0
 
I Use This

mediacloud_backend

Compare

  Analyzed 22 days ago

MediaCloud backend repository

23K lines of code

0 current contributors

almost 2 years since last commit

0 users on Open Hub

Very Low Activity
0.0
 
I Use This