A word clustering approach to domain adaptation: Robust parsing of source and target domains

@article{Seddah2014AWC,
  title={A word clustering approach to domain adaptation: Robust parsing of source and target domains},
  author={Djam{\'e} Seddah and Marie Candito and Enrique Henestroza Anguiano},
  journal={J. Log. Comput.},
  year={2014},
  volume={24},
  pages={395-411}
}
We present a technique to improve out-of-domain statistical parsing by reducing lexical data sparseness in a PCFG-LA architecture. We replace ter- minal symbols with unsupervised word clusters acquired from a large news- paper corpus augmented with target-domain data. We also investigate the impact of guiding out-of-domain parsing with predicted part-of-speech tags. We provide an evaluation for French, and obtain improvements in perfor- mance for both non-technical and technical target domains… Expand
The Devil is in the Details: Parsing Unknown German Words
TLDR
It is demonstrated that methods that have improved results for other languages do not transfer directly to German, and that one can obtain better results using a simplistic model rather than a more generalized model for rare and unknown word handling. Expand
Large-scale deep linguistic processing IN COLLABORATION WITH: Analyse Linguistique Profonde A Grande Echelle (ALPAGE)
The general aim of PARSEME is increasing and enhancing the ICT support of the European multilingual heritage. This aim is pursued via more detailed objectives: (1) to put multilingualism in focus ofExpand
Multilingual discriminative shift reduce phrase structure parsing for the SPMRL 2014 shared task
TLDR
The design of a multilingual lexicalized discriminative shift reduce phrase structure based parser used to parse the SPMRL 2014 shared task data set is described. Expand
De l'étiquetage syntaxique pour les grammaires catégorielles de dépendances à l'analyse par transition dans le domaine de l'analyse en dépendances non-projective. (From syntactic tagging for categorial dependency grammars to transition-based parsing in the domain of non-projective dependency parsing
Cette these prend place dans le domaine de l’analyse syntaxique en dependances. D’une part nous etudions l’impact d’une methode statistique d’etiquetage syntaxique sur un analyseur base sur lesExpand
Efficient Latent-variable Grammars : Learning and Inference
TLDR
A selection of images from around the world show the efforts towards in-situ awareness that has been implemented at a number of Wikimedia projects. Expand

References

SHOWING 1-10 OF 49 REFERENCES
Simple Semi-supervised Dependency Parsing
TLDR
This work focuses on the problem of lexical representation, introducing features that incorporate word clusters derived from a large unannotated corpus, and shows that the cluster-based features yield substantial gains in performance across a wide range of conditions. Expand
Automatic Domain Adaptation for Parsing
TLDR
The resulting system proposes linear combinations of parsing models trained on the source corpora that outperforms all non-oracle baselines including the best domain-independent parsing model. Expand
Improving generative statistical parsing with semi-supervised word clustering
TLDR
A semi-supervised method to improve statistical parsing performance and a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus are presented. Expand
Parsing Word Clusters
TLDR
It is found that replacing word forms with clusters improves attachment performance for words that are originally either unknown or low-frequency, since these words are replaced by cluster symbols that tend to have higher frequencies. Expand
Parsing Biomedical Literature
TLDR
It is shown how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities, without requiring in-domain treebank data. Expand
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging
TLDR
It is found that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half. Expand
Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort
TLDR
It is found that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half. Expand
Enriching a French Treebank
TLDR
The current status of the French treebank is presented, fully annotated and disambiguated for parts of speech, inflectional morphology, compounds and lemmas, and syntactic constituents, and is now being enriched with functional information, and used for parsing evaluation. Expand
On Statistical Parsing of French with Supervised and Semi-Supervised Strategies
TLDR
This paper investigates how to best train a parser on the French Treebank, viewing the task as a trade-off between generaliz-ability and interpretability, and compares a supervised lexicalized parsing algorithm with a semi-supervised un-lexicalized algorithm along the lines of Crabbe and Candito, 2008. Expand
Reranking and Self-Training for Parser Adaptation
TLDR
The reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2% and use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data. Expand
...
1
2
3
4
5
...