Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Heritrix: Internet Archive Web Crawler

Compare

  Analyzed 4 months ago

The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

335K lines of code

9 current contributors

4 months since last commit

10 users on Open Hub

Activity Not Available
4.33333
   
I Use This
Tags webcrawler

Ex-Crawler

Compare

  Analyzed 29 days ago

Ex-Crawler Project is divided into three subprojects: The main part is the Ex-Crawler daemon server, a highly configurable, flexible (Web-) crawler written in Java. It comes with it's own socket server, where you can manage the server, own usermanagement, distributed grid / volunteer computing ... [More] and much more. Crawled informations are stored in Database. Currently MySQL, PostgreSQL and MSSQL are supported. The graphical (Java Swing) distributed grid / volunteer computing client, including pc idling detection and much more. The web search engine written in PHP. It comes with a CMS, multi language detection and support, templates using smarty. And an application framework partly forked from joomla, so that joomla components could be adapted fast. [Less]

72.1K lines of code

0 current contributors

about 6 years since last commit

3 users on Open Hub

Inactive
5.0
 
I Use This

crawler4j

Compare

  Analyzed about 1 year ago

Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and ... [More] handles the downloaded page. The following is a sample implementation: import java.util.ArrayList; import java.util.regex.Pattern; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.url.WebURL; public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); public MyCrawler() { } public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); if (filters.matcher(href).matches()) { return false; } if (href.startsWith("http://www.ics.uci.edu/")) { return true; } return false; } public void visit(Page page) { int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); String text = page.getText(); ArrayList links = page.getURLs(); } } As can be seen in the above code, there are two main functions that should be overridden: shouldVisit: This function decides whether the given URL should be crawled or not. visit: This function is called after the content of a URL is downloaded successfully. You can easily get the text, links, url and docid of the downloaded page. You should also implement a controller class which specifies the seeds of the crawl, the folder in which crawl data should be stored and number of concurrent thread: import edu.uci.ics.crawler4j.crawler.CrawlController; public class Controller { public static void main(String[] args) throws Exception { CrawlController controller = new CrawlController("/data/crawl/root"); controller.addSeed("http://www.ics.uci.edu/"); controller.start(MyCrawler.class, 10); } }PolitenessCrawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. This parameter can be tuned with the "setPolitenessDelay" function in controller. DependenciesThe following libraries are used in the implementation of crawler4j. In order to make life easier all of them are bundled in the "crawler4j-dependencies-lib.zip" package: Berkeley DB Java Edition 4.0.71 or higher fastutil 5.1.5 DSI Utilities 1.0.10 or higher Apache HttpClient 4.0.1 Apache Log4j 1.2.15 Apache Commons Logging 1.1.1 Apache Commons Codec 1.4 Source CodesSource codes are available for checkout from this subversion repository: https://crawler4j.googlecode.com/svn/trunk/ [Less]

2.17K lines of code

0 current contributors

about 6 years since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This

Smart Cache Loader

Compare

  Analyzed 28 days ago

Smart Cache Loader is most configurable web batch downloader in world! If you have a very specific needs to grab some portions of web site -- this is right tool for you! This program can be also used as web crawler if you need to crawl defined parts of www site(s).

5.33K lines of code

0 current contributors

about 7 years since last commit

2 users on Open Hub

Inactive
4.0
   
I Use This

LinkChecker

Compare

  Analyzed 11 months ago

Check websites and HTML documents for broken links. * recursive and multithreaded checking * output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats * HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support * restriction of link ... [More] checking with regular expression filters for URLs * proxy support * username/password authorization for HTTP and FTP and Telnet [Less]

101 lines of code

0 current contributors

over 4 years since last commit

2 users on Open Hub

Activity Not Available
3.0
   
I Use This

Spidr

Compare

  Analyzed 3 months ago

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

3.85K lines of code

5 current contributors

8 months since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This

ldspider

Compare

  Analyzed about 1 year ago

The ldspider project aims to build a web crawling framework for the linked data web. Requirements and challenges for crawling the linked data web are different from regular web crawling, thus this projects offer a web crawler adapted to traverse and harvest sources and instances from the linked ... [More] data web. The project is a co-operation between Andreas Harth at AIFB and Juergen Umbrich and Aidan Hogan at DERI. [Less]

14.4K lines of code

0 current contributors

almost 3 years since last commit

1 users on Open Hub

Activity Not Available
5.0
 
I Use This

Anemone

Compare

  Analyzed 30 days ago

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone ... [More] fast. The API makes it simple. And the expressiveness of Ruby makes it powerful. [Less]

2.11K lines of code

0 current contributors

about 5 years since last commit

1 users on Open Hub

Inactive
5.0
 
I Use This

WenChuan

Compare

  Analyzed almost 3 years ago

WebChuan is a set of open source libraries and tools for getting and parsing web pages of website. It is written in Python, based on Twisted and lxml. It is inspired by GStreamer. WebChuan is designed to be back-end of web-bot, it is easy to use, powerful, flexible, reusable and efficient.

628 lines of code

0 current contributors

over 8 years since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This

slbcrawler

Compare

  Analyzed almost 3 years ago

SLBCrawler is a highly scalable distributed web crawler written in Erlang.

6.4K lines of code

0 current contributors

about 8 years since last commit

0 users on Open Hub

Activity Not Available
0.0
 
I Use This