Corpus ID: 1090094

ASV Toolbox: a Modular Collection of Language Exploration Tools

@inproceedings{Biemann2008ASVTA,
  title={ASV Toolbox: a Modular Collection of Language Exploration Tools},
  author={Chris Biemann and Uwe Quasthoff and Gerhard Heyer and Florian Holz},
  booktitle={LREC},
  year={2008}
}
ASV Toolbox is a modular collection of tools for the exploration of written language data both for scientific and educational purposes. It includes modules that operate on word lists or texts and allow to perform various linguistic annotation, classification and clustering tasks, including language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern… Expand
Using Semantics for Granularities of Tokenization
TLDR
The methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging. Expand
Using Semantics for Granularities of Tokenization
Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful languageExpand
Data Selection with Cluster-Based Language Difference Models and Cynical Selection
TLDR
The recently proposed cynical data selection method is validated, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus. Expand
Scalable Construction of High-Quality Web Corpora
TLDR
This article first focuses on web crawling and the pros and cons of the existing crawling strategies, and describes how the crawled data can be linguistically pre-processed in a parallelized way that allows the processing of web-scale input data. Expand
Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web
TLDR
This paper proposes a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. Expand
Computational approaches to the comparison of regional variety corpora : prototyping a semi-automatic system for German
Regional varieties of pluri-centric languages such as German are generally very similar with respect to their structure and the linguistic phenomena that occur. The extraction of differences is thusExpand
Approaches to Automatic Text Structuring
TLDR
Two prototypes of textStructuring systems are presented, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks, and the effect of senses on computing similarities is analyzed. Expand
Text: now in 2D! A framework for lexical expansion with contextual similarity
A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmaticExpand
Domain-Specific Corpus Expansion with Focused Webcrawling
This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate theExpand
SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines
TLDR
The SemRelData dataset is presented that contains annotations of semantic relations between nominals in the context of one paragraph that shows that knowledge bases not only have coverage gaps; they also do not account for semantic relations that are manifested in particular contexts only, yet still play an important role for text cohesion. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 39 REFERENCES
Stemming and Decompounding for German Text Retrieval
TLDR
The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. Expand
Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases
TLDR
The type of analysis used (surface grammatical analysis) is highlighted, as the methodological approach adopted to adapt the rules (experimental approach). Expand
Weka: Practical machine learning tools and techniques with Java implementations
The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freelyExpand
Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting
TLDR
The method uses n-gram counts, achieving a function similar to, but more general than, a stemmer, allowing operation in any language or domain with only trivial modification. Expand
Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems
TLDR
The performance of Chinese Whispers is measured on Natural Language Processing (NLP) problems as diverse as language separation, acquisition of syntactic word classes and word sense disambiguation. Expand
Unsupervised and Knowledge-Free Learning of Compound Splits and Periphrases
TLDR
An approach for knowledge-free and unsupervised recognition of compound nouns for languages that use one-wordcompounds such as Germanic and Scandinavian languages is presented, showing promising results above 80% precision for the splits and about half of the compounds periphrased correctly. Expand
Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering
TLDR
An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods and a Viterbi POS tagger is trained, which is refined by a morphological component. Expand
Unsupervised Learning of Naive Morphology with Genetic Algorithms
TLDR
An attempt to use the minimal description length MDL as the one bias for deriving lexicons of morphemes from a raw list of words MDL is used as a tness function of a simple genetic algorithm. Expand
NLTK: The Natural Language Toolkit
NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic andExpand
Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming
TLDR
This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words which produce segmentations which are linguistically meaningful, and to a large degree conforming to the annotation provided. Expand
...
1
2
3
4
...