Share This Author
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
- Phillip Rust, Jonas Pfeiffer, Ivan Vulic, Sebastian Ruder, Iryna Gurevych
- Linguistics, Computer ScienceACL
- 31 December 2020
It is found that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.
Challenges and Strategies in Cross-Cultural NLP
Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to…
PuzzLing Machines: A Challenge on Learning From Small Data
This work introduces a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students, and shows that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge.
Language Modelling with Pixels
- Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott
- Computer Science, LinguisticsArXiv
- 14 July 2022
PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.