NLTK — the Natural Language Toolkit — is a suite of open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.
Treex (formerly TectoMT) is a highly modular NLP software system implemented in Perl programming language under Linux.
It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to
... [More] significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces. [Less]
Use the internet as a linguistic corpus:
Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.
Develop a classification engine that learns to automatically annotate pages, provide visual tools for inspection of results.
This site offers a set of Bash scripts and Windows executables add-ins that, together, create a basic translation chain prototype able of processing very large corpora. It uses Moses, a widely known statistical machine translation system.
The idea is to help build a translation chain for the real
... [More] world, but it should also enable a quick evaluation of Moses for actual translation work and guide users in their first steps of using Moses. The scripts cover the installation, the creation of representative test files, the training, the translation, the scoring and the transfer of trainings between persons or between several Moses installations.
A Help/Short Tutorial (http://moses-for-mere-mortals.googlecode.com/files/Help.odt) and a demonstration corpus (too small for doing justice to the qualitative results that can be achieved with Moses, but able of giving a realistic view of the relative duration of the steps involved) are available.
Two Windows add-ins allow the creation of Moses input files from *.TMX translation memories (Extract_TMX_Corpus.exe), as well as the creation of *.TMX files from Moses output files (Moses2TMX.exe). A synergy between machine translation and translation memories is therefore created.
The scripts were tested in Ubuntu 9.04 (64-bit version). Documents used for corpora training should be perfectly aligned and saved in UTF-8 character encoding. Documents to be translated should also be in UTF-8 format. One would expect the users of these scripts, perhaps after having tried the provided demonstration corpus, to immediately use and get results with the real corpora they are interested in.
Though already tested and used in actual work, this should be considered a work in progress. So as to protect the users not yet completely acquainted with Moses, these scripts try to avoid mistakes that would cost them dearly in terms of time and/or results, but do not completely insulate them (especially from the consequences of malformed corpora files). [Less]
The LexAt "lexical attraction" aka the RelEx Statistical Linguistics package adds statistical algorithms to the RelEx. Corpus statistics, including mutual information, are maintained in an SQL database, and drawn on to enhance various RelEx functions, such as parse ranking and chunk ranking, and word-sense disambiguation (Mihalcea algo).
Affisix is a program for automatic recognition of affixes. It takes large amount of words and according to the user setting it tries to determine which segments of these words are prefixes.
A high-level interface to the CMU Link Grammar.
This binding wraps the link-grammar shared library provided by the AbiWord project for their grammar-checker.
CSniper (Corpus Sniper) is a tool that implements (i) a web-based multi-user scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This
... [More] annotation-by-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools. [Less]
CorpusCatcher is a corpus collection toolset. It can help you to build language or topic specific corpora from publicly available web resources. This can be very useful for many purposes, especially for data to build spell checkers.
This site uses cookies to give you the best possible experience.
By using the site, you consent to our use of cookies.
For more information, please see our
Privacy Policy