Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Apache Spark

Compare

Claimed by Apache Software Foundation Analyzed 7 months ago

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly more rapidly than with ... [More] disk-based systems like Hadoop. To make programming faster, Spark offers high-level APIs in Scala, Java and Python, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala or Python shells. Spark integrates closely with Hadoop to run inside Hadoop clusters and can access any existing Hadoop data source. [Less]

1.4M lines of code

380 current contributors

7 months since last commit

55 users on Open Hub

Activity Not Available
5.0
 
I Use This

Apache HBase

Compare

Claimed by Apache Software Foundation Analyzed about 4 hours ago

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides ... [More] Bigtable-like capabilities on top of Hadoop. HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardward. Try it if your plans for a data store run to big. [Less]

862K lines of code

146 current contributors

about 9 hours since last commit

30 users on Open Hub

Very High Activity
5.0
 
I Use This

Apache Hive

Compare

Claimed by Apache Software Foundation Analyzed 7 months ago

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called ... [More] Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language. [Less]

1.72M lines of code

114 current contributors

8 months since last commit

24 users on Open Hub

Activity Not Available
5.0
 
I Use This

Apache Mahout

Compare

Claimed by Apache Software Foundation Analyzed 5 months ago

Apache Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. ... [More] However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms [Less]

144K lines of code

2 current contributors

11 months since last commit

24 users on Open Hub

Activity Not Available
3.6
   
I Use This

Apache Accumulo

Compare

Claimed by Apache Software Foundation Analyzed about 5 hours ago

Apache Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can ... [More] modify key/value pairs at various points in the data management process. [Less]

454K lines of code

39 current contributors

2 days since last commit

24 users on Open Hub

High Activity
0.0
 
I Use This

Apache Pig

Compare

Claimed by Apache Software Foundation Analyzed about 5 hours ago

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which ... [More] in turns enables them to handle very large data sets. [Less]

359K lines of code

5 current contributors

6 days since last commit

11 users on Open Hub

Low Activity
5.0
 
I Use This
Tags hadoop pig

Apache Flink

Claimed by Apache Software Foundation Analyzed about 6 hours ago

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Learn more about Flink at http://flink.apache.org/

1.12M lines of code

279 current contributors

6 months since last commit

9 users on Open Hub

Very High Activity
5.0
 
I Use This

AppScale

Compare

  Analyzed about 2 hours ago

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and ... [More] reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces. [Less]

1.14M lines of code

11 current contributors

6 days since last commit

7 users on Open Hub

High Activity
5.0
 
I Use This

Apache Impala

Compare

Claimed by Apache Software Foundation Analyzed about 3 hours ago

Apache Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This ... [More] provides a familiar and unified platform for batch-oriented or real-time queries. [Less]

586K lines of code

64 current contributors

about 1 month since last commit

7 users on Open Hub

Very High Activity
5.0
 
I Use This

Apache Avro

Compare

Claimed by Apache Software Foundation Analyzed about 6 hours ago

Avro is a serialization system.

184K lines of code

65 current contributors

5 days since last commit

6 users on Open Hub

High Activity
0.0
 
I Use This