Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

@inproceedings{Nguyen2021TrankitAL,
  title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
  author={Minh Nguyen and Viet Dac Lai and Amir Pouran Ben Veyseh and Thien Huu Nguyen},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics},
  year={2021}
}
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive… 

Figures and Tables from this paper

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

A new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia is presented and the first cross-domain parsing experiments in Hebrew are conducted.

Punctuation Restoration

This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts, and shows that popular natural language processing toolkits are incapable of detecting sentence boundary on non-punctuated transcripts of livestreaming videos.

SlovakBERT: Slovak Masked Language Model

A new Slovak masked language model called SlovakBERT is introduced, as well as the fine-tuned models for part-of-speech tag, sentiment analysis and semantic textual similarity.

Enhanced Universal Dependency Parsing with Automated Concatenation of Embeddings

This paper describes the system used in the submission to the IWPT 2021 Shared Task, a graph-based parser with the technique of Automated Concatenation of Embeddings (ACE) to automatically find the better concatenation of embeddings for the task of enhanced universal dependencies.

Benchmarking Pre-trained Language Models for Multilingual NER: TraSpaS at the BSNLP2021 Shared Task

TraSpaS is described, a submission to the third shared task on named entity recognition hosted as part of the Balto-Slavic Natural Language Processing (BSNLP) Workshop, and results show that the Trankit-based models outperformed those based on the other two toolkits, even when trained on smaller amounts of data.

Joint Extraction of Entities, Relations, and Events via Modeling Inter-Instance and Inter-Label Dependencies

Noise Contrastive Estimation is introduced to address the maximization of the intractable joint likelihood for model training, and Simulated Annealing is presented to better find the globally optimal assignment for instance labels at decoding time.

A French Corpus of Québec’s Parliamentary Debates

Parliamentary debates offer a window on political stances as well as a repository of linguistic and semantic knowledge. They provide insights and reasons for laws and regulations that impact electors

Midas Loop: A Prioritized Human-in-the-Loop Annotation for Large Scale Multilayer Data

A collaborative, version-controlled online annotation environment for multilayer corpus data which includes integrated provenance and confidence metadata for each piece of information at the document, sentence, token and annotation level is presented.

Everybody likes short sentences - A Data Analysis for the Text Complexity DE Challenge 2022

The modeling approach for this shared task utilizes off-the-shelf NLP tools for feature engineering and a Random Forest regression model that identified the text length, or resp.

The ParlaMint corpora of parliamentary proceedings

The ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words are presented, which are uniformly encoded, contain rich meta-data, and are linguistically annotated following the Universal Dependencies formalism and with named entities.

References

SHOWING 1-10 OF 43 REFERENCES

N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models

N-LTP is introduced, an open-source Python Chinese natural language processing toolkit supporting five basic tasks: Chinese word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and semantic dependency parsing and is the first toolkit to support all Chinese NLP fundamental tasks.

Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge

A neural model named TwASP is proposed for joint CWS and POS tagging following the character-based sequence labeling paradigm, where a two-way attention mechanism is used to incorporate both context feature and their corresponding syntactic knowledge for each input character.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

BERT Meets Chinese Word Segmentation

B Bidirectional Encoder Representation from Transformers (BERT), a new language representation model, has been proposed as a backbone model for many natural language tasks and redefined the corresponding performance in solving the in-domain CWS task.

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP

The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.

BERT Rediscovers the Classical NLP Pipeline

This work finds that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.

75 Languages, 1 Model: Parsing Universal Dependencies Universally

It is found that fine-tuning a multilingual BERT self-attention model pretrained on 104 languages can meet or exceed state-of-the-art UPOS, UFeats, Lemmas, (and especially) UAS, and LAS scores, without requiring any recurrent or language-specific components.

Improving Cross-Lingual Transfer for Event Argument Extraction with Language-Universal Sentence Structures

This paper introduces two novel sources of language-independent information for CEAE models based on the semantic similarity and the universal dependency relations of the word pairs in different languages and proposes to use the two sources of information to produce shared sentence structures to bridge the gap between languages and improve the cross-lingual performance of the CEae models.

Cross-lingual Word Sense Disambiguation using mBERT Embeddings with Syntactic Dependencies

This project proposes the concatenated embeddings by producing dependency parse tress and encoding the relative relationships of words into the input embedDings to investigate how syntactic information can be added into the BERTembeddings to result in both semantics- and syntax-incorporated word embeddINGS.

MorphoBERT: a Persian NER System with BERT and Morphological Analysis

This paper trains the BERT model on a large volume of Persian texts to get a highly accurate representation of tokens and applies a BiLSTM (bidirectional LSTM) on vector representations to label tokens to inform the model of this information.