Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language
@article{Shmidman2022IntroducingBB, title={Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language}, author={Avi Shmidman and Joshua Guedalia and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Eli Handel and Moshe Koppel}, journal={ArXiv}, year={2022}, volume={abs/2208.01875} }
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel (BERT Embeddings for Rabbinic-Encoded Language). Whilst other PLMs exist for processing Hebrew texts (e.g., HeBERT, Aleph-Bert), they are all trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographi cal, morphological, syntactic and orthographic norms. We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs. We…
One Citation
Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material
- Computer ScienceNLP4DH
- 2022
This work proposes a system for classification of rabbinic literature based on its style, leveraging recently released pretrained Transformer models for Hebrew and demonstrates how the method can be applied to uncover lost material from the Midrash Tanhuma.
References
SHOWING 1-10 OF 10 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With
- Computer ScienceArXiv
- 2021
AlephBERT is presented, a large pre-trained language model for Modern Hebrew, which is trained on larger vocabulary and a larger dataset than any Hebrew PLM before, and made publicly available, providing a single point of entry for the development of Hebrew NLP applications.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Computer ScienceACL
- 2020
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.
What’s Wrong with Hebrew NLP? And How to Make it Right
- Computer ScienceEMNLP
- 2019
The design and use of the ONLP suite is described, a joint morpho-syntactic infrastructure for processing Modern Hebrew texts, which provides rich and expressive annotations which already serve diverse academic and commercial needs.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
- Computer ScienceArXiv
- 2019
The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Computer ScienceJ. Mach. Learn. Res.
- 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition
- Computer ScienceINFORMS Journal on Data Science
- 2022
HeBERT and HebEMO are introduced, a transformer-based model for modern Hebrew text which relies on a BERT (bidirectional encoder representations from transformers) architecture and a tool that uses HeBERT to detect polarity and extract emotions from Hebrew UGC.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Morphological Processing of Semitic Languages
- LinguisticsNLP of Semitic Languages
- 2014
This chapter begins with a recapitulation of the challenges these phenomena pose on computational applications, and discusses the approaches that were suggested to cope with these challenges in the past.
Morphological Processing of Semitic Languages, pages 43–66
- Springer Berlin Heidelberg, Berlin, Heidelberg.
- 2014