Projects tagged ‘corpus’

Natural Language Toolkit (NLTK)

Analyzed about 5 hours ago

NLTK — the Natural Language Toolkit — is a suite of open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.

235K lines of code

42 current contributors

10 days since last commit

45 users on Open Hub

Moderate Activity

0 Reviews

I Use This

Mostly written in Python

Licenses: apache_2

Text Encoding Initiative

Analyzed 1 day ago

The TEI is an international and interdisciplinary community-based open standard used by research project, libraries, museums, publishers, and academics to represent all kinds of literary and linguistic texts, using an encoding scheme that is maximally expressive and minimally obsolescent.

583K lines of code

14 current contributors

7 days since last commit

3 users on Open Hub

Moderate Activity

0 Reviews

I Use This

Mostly written in XSL Transformation

Licenses: BSD-2-Clause, cc-by-3

Tags corpus corpus_linguistics digitalhumanities multilingual tei tei_c text_processing xml xml_schema xslt

krdwrd

Analyzed about 12 hours ago

Use the internet as a linguistic corpus: Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora. Develop a classification engine that learns to automatically annotate pages, provide visual tools for inspection of results.

3.35K lines of code

1 current contributors

about 6 years since last commit

2 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in TeX/LaTeX

Licenses: gpl

Tags addon corpora corpus firefox linguistics machine_learning xul

Atomic (multi-level annotation)

A

Analyzed 1 day ago

Software for multi-level annotation of linguistic corpora

17K lines of code

0 current contributors

over 8 years since last commit

1 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Java

Licenses: apache_2

Tags annotation corpus eclipsercp linguistics multi-layer science

Greenstone

G

No analysis available

Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed ... [More]

0 lines of code

0 current contributors

0 since last commit

1 users on Open Hub

Activity Not Available

0 Reviews

I Use This

Mostly written in language not available

Licenses: gpl

Tags c corpus cross-platform digitallibrary dublincore library metadata naturallanguages perl text textprocessing unesco 2 more...

IMS Open Corpus Workbench

I

Analyzed over 1 year ago

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

281K lines of code

2 current contributors

over 1 year since last commit

1 users on Open Hub

Activity Not Available

0 Reviews

I Use This

Mostly written in PHP

Licenses: No declared licenses

Tags corpus index language linguistics query search xml

LexAt Lexical/Corpus Statistics

L

No analysis available

The LexAt "lexical attraction" aka the RelEx Statistical Linguistics package adds statistical algorithms to the RelEx. Corpus statistics, including mutual information, are maintained in an SQL database, and drawn on to enhance various RelEx functions, such as parse ranking and chunk ranking, and word-sense disambiguation (Mihalcea algo).

0 lines of code

0 current contributors

0 since last commit

1 users on Open Hub

Activity Not Available

0 Reviews

I Use This

Mostly written in language not available

Licenses: apache_2

Tags computational_linguistics corpora corpus corpus_linguistics database java linguistics natural_language natural_language_processing nlp opencog perl 1 more...

opencorpora

O

Analyzed 1 day ago

An engine for creating and annotating textual corpora

38.6K lines of code

3 current contributors

about 2 years since last commit

1 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in PHP

Licenses: gpl

Tags computational_linguistics corpora corpus corpus_linguistics crowdsourcing disambiguation linguistics natural-language-processing natural_language_processing nlp part_of_speech russian 1 more...

stem-search.vim

S

Analyzed about 19 hours ago

StmSrch is a reverse-stem searching script. It implements the Porter stemming algorithm, by Martin Porter. It also handles irregular verbs and noun pluralizations. This script can be useful for searching or scanning through corpus files. Each word input to the :StmSrch command will be stemmed ... [More]

308 lines of code

0 current contributors

over 15 years since last commit

0 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Vim Script

Licenses: mit

Tags corpus corpus_linguistics nlp porter search stem vim

Zeitcrawler

Z

Analyzed 1 day ago

A specialized crawler for the German newspaper 'Die Zeit'. Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text ... [More]

1.64K lines of code

0 current contributors

over 11 years since last commit

0 users on Open Hub

Inactive

0 Reviews

I Use This

Mostly written in Perl

Licenses: gpl3

Tags academic computational_linguistics corpus corpus_linguistics crawler digital_humanities natural_language_processing nlp perl unix webcrawler xml

Tags : Browse Projects