Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Natural Language Toolkit (NLTK)

Compare

  Analyzed 3 days ago

NLTK — the Natural Language Toolkit — is a suite of open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.

214K lines of code

67 current contributors

8 days since last commit

45 users on Open Hub

High Activity
5.0
 
I Use This

Treex - NLP Framework

Compare

  Analyzed 4 days ago

Treex (formerly TectoMT) is a highly modular NLP software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to ... [More] significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces. [Less]

224K lines of code

7 current contributors

2 months since last commit

4 users on Open Hub

Moderate Activity
5.0
 
I Use This

krdwrd

Compare

  Analyzed 8 days ago

Use the internet as a linguistic corpus: Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora. Develop a classification engine that learns to automatically annotate pages, provide visual tools for inspection of results.

117K lines of code

0 current contributors

over 3 years since last commit

2 users on Open Hub

Inactive
5.0
 
I Use This

Open-Content Text Corpus

Compare

  Analyzed 4 days ago

The OCTC hosts open-content texts, encoded in TEI P5 XML, for many languages, each in a separate subcorpus. Another part of the OCTC stores interlanguage alignment info. The project is intended to be an open platform for academic and research projects of various kinds (tool-, markup-, or ... [More] language-documentation-oriented) and for collaboration on multilingual corpus encoding in general and application of the TEI Guidelines for that purpose in particular. ("TEI" stands for the Text Encoding Initiative, http://www.tei-c.org/) [Less]

0 lines of code

0 current contributors

0 since last commit

2 users on Open Hub

Activity Not Available
0.0
 
I Use This
Mostly written in language not available
Licenses: GPL-3.0+

TestEl

Compare

  Analyzed over 3 years ago

TestEl is a Java-based learning analyzer for HTML (and possibly other) structured documents. It can be trained to detect structures in such documents and renders hits in XML.

7.59K lines of code

0 current contributors

about 9 years since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This

LexAt Lexical/Corpus Statistics

Compare

  Analyzed 4 days ago

The LexAt "lexical attraction" aka the RelEx Statistical Linguistics package adds statistical algorithms to the RelEx. Corpus statistics, including mutual information, are maintained in an SQL database, and drawn on to enhance various RelEx functions, such as parse ranking and chunk ranking, and word-sense disambiguation (Mihalcea algo).

9.59K lines of code

0 current contributors

about 8 years since last commit

1 users on Open Hub

Inactive
0.0
 
I Use This

CORSIS

Compare

  Analyzed 4 days ago

CORSIS (formerly Tenka Text) is a performance‐oriented, open‐source library for corpus analysis. It utilizes typed assembly, task‐specific compilers and parallelization to deliver the best performance with elegant design. Demonstrative GUI of the project comes with Wordlister - an advanced ... [More] , extremely fast graphical wordlist tool and a regex concordance tool. CORSIS - the open-source answer to WordSmith Tools. [Less]

0 lines of code

0 current contributors

0 since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This
Mostly written in language not available
Licenses: GPL-3.0+

Affisix

Compare

  Analyzed 4 days ago

Affisix is a program for automatic recognition of affixes. It takes large amount of words and according to the user setting it tries to determine which segments of these words are prefixes.

6.13K lines of code

0 current contributors

almost 6 years since last commit

1 users on Open Hub

Inactive
4.0
   
I Use This

Ruby LinkParser

Compare

  Analyzed 4 days ago

A high-level interface to the CMU Link Grammar. This binding wraps the link-grammar shared library provided by the AbiWord project for their grammar-checker.

2.42K lines of code

1 current contributors

about 1 month since last commit

1 users on Open Hub

Very Low Activity
0.0
 
I Use This

moses-for-mere-mortals

Compare

  Analyzed over 1 year ago

This site offers a set of Bash scripts and Windows executables add-ins that, together, create a basic translation chain prototype able of processing very large corpora. It uses Moses, a widely known statistical machine translation system. The idea is to help build a translation chain for the real ... [More] world, but it should also enable a quick evaluation of Moses for actual translation work and guide users in their first steps of using Moses. The scripts cover the installation, the creation of representative test files, the training, the translation, the scoring and the transfer of trainings between persons or between several Moses installations. A Help/Short Tutorial (http://moses-for-mere-mortals.googlecode.com/files/Help.odt) and a demonstration corpus (too small for doing justice to the qualitative results that can be achieved with Moses, but able of giving a realistic view of the relative duration of the steps involved) are available. Two Windows add-ins allow the creation of Moses input files from *.TMX translation memories (Extract_TMX_Corpus.exe), as well as the creation of *.TMX files from Moses output files (Moses2TMX.exe). A synergy between machine translation and translation memories is therefore created. The scripts were tested in Ubuntu 9.04 (64-bit version). Documents used for corpora training should be perfectly aligned and saved in UTF-8 character encoding. Documents to be translated should also be in UTF-8 format. One would expect the users of these scripts, perhaps after having tried the provided demonstration corpus, to immediately use and get results with the real corpora they are interested in. Though already tested and used in actual work, this should be considered a work in progress. So as to protect the users not yet completely acquainted with Moses, these scripts try to avoid mistakes that would cost them dearly in terms of time and/or results, but do not completely insulate them (especially from the consequences of malformed corpora files). [Less]

11K lines of code

2 current contributors

about 2 years since last commit

1 users on Open Hub

Activity Not Available
0.0
 
I Use This