Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

ht://Dig

Compare

  Analyzed 3 days ago

The ht://Dig system is a complete WWW indexing and searching system for a domain or intranet. This system is not meant to replace the need for internet-wide search systems like Lycos, Infoseek, Google, and AltaVista. Instead, it is meant to cover the search needs for a single company, campus, or ... [More] even a particular sub-section of a Web site. [Less]

507K lines of code

0 current contributors

over 12 years since last commit

20 users on Open Hub

Inactive
3.4
   
I Use This

YaCy

Compare

  Analyzed 17 days ago

YaCy is a P2P search engine for the WWW including a crawler and HTTP proxy

256K lines of code

15 current contributors

17 days since last commit

12 users on Open Hub

High Activity
4.71429
   
I Use This

Grub

Compare

  Analyzed 1 day ago

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain free (as in freedom) index of the Web. At this moment we have very simple search engine (as a proof of concept) too.

40.6K lines of code

0 current contributors

over 6 years since last commit

7 users on Open Hub

Inactive
4.0
   
I Use This
Licenses: BSD-3-Clause, GPL-3.0+

Apache ManifoldCF

Compare

Claimed by Apache Software Foundation Analyzed 5 days ago

ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr. ManifoldCF also defines a security model for target repositories that permits them to enforce source-repository security policies.

0 lines of code

0 current contributors

0 since last commit

6 users on Open Hub

Activity Not Available
0.0
 
I Use This
Mostly written in language not available
Licenses: Apache-2.0

OpenSearchServer

Compare

  Analyzed 1 day ago

OpenSearchServer is a powerful, enterprise-class, search engine program. With the web user interface, the crawlers (web, file, database, ...) and its REST API you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Windows ... [More] and Linux/Unix/BSD Multilingual lemmatization, spellcheck, stop words, synonyms, facet, filters, web crawler, database crawler, local and remote file system crawler, documents indexation with OCR, REST with XML or JSON and SOAP API. [Less]

113K lines of code

2 current contributors

2 months since last commit

4 users on Open Hub

Low Activity
5.0
 
I Use This

Serialist

Compare

  Analyzed about 6 years ago

Serialist crawls serial stories on the web (such as webcomics), and provides a web interface for users to navigate these serials, mark where they left off, and find out when new pages exist.

2.73K lines of code

2 current contributors

almost 7 years since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This

Murloc

Compare

  Analyzed 5 days ago

Murloc is a website parser framework written in PHP. It features a plugin system that allows quick and easy development of new parsers. It provides several functions that are more or less commonly needed by parsers. Murloc comes with a bunch of handy plugins. As it is written in PHP, it should ... [More] run on pretty much any platform, although some of its features are missing on Microsoft Windows systems. [Less]

0 lines of code

0 current contributors

0 since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This
Mostly written in language not available
Licenses: BSD-2-Clause

Ex-Crawler

Compare

  Analyzed 4 days ago

Ex-Crawler Project is divided into three subprojects: The main part is the Ex-Crawler daemon server, a highly configurable, flexible (Web-) crawler written in Java. It comes with it's own socket server, where you can manage the server, own usermanagement, distributed grid / volunteer computing ... [More] and much more. Crawled informations are stored in Database. Currently MySQL, PostgreSQL and MSSQL are supported. The graphical (Java Swing) distributed grid / volunteer computing client, including pc idling detection and much more. The web search engine written in PHP. It comes with a CMS, multi language detection and support, templates using smarty. And an application framework partly forked from joomla, so that joomla components could be adapted fast. [Less]

0 lines of code

0 current contributors

0 since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This
Mostly written in language not available
Licenses: GPL-3.0+

crawler4j

Compare

  Analyzed almost 2 years ago

Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and ... [More] handles the downloaded page. The following is a sample implementation: import java.util.ArrayList; import java.util.regex.Pattern; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.url.WebURL; public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); public My [Less]

2.17K lines of code

0 current contributors

almost 7 years since last commit

3 users on Open Hub

Activity Not Available
5.0
 
I Use This

Eclipse SMILA

Compare

Claimed by Eclipse Foundation Analyzed about 3 years ago

The amount and diversity of information is growing exponentially, mainly in the area of unstructured data, like emails, text files, blogs, images etc. Poor data accessibility, user rights integration and the lack of semantic meta data are constraining factors for building next generation enterprise ... [More] search and other document centric applications. Missing standards result in proprietary solutions with huge short and long term cost. SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. [Less]

314K lines of code

5 current contributors

about 3 years since last commit

2 users on Open Hub

Activity Not Available
0.0
 
I Use This