• Corpus ID: 248069654

Homepage2Vec: Language-Agnostic Website Embedding and Classification

  title={Homepage2Vec: Language-Agnostic Website Embedding and Classification},
  author={Sylvain Lugeon and Tiziano Piccardi and Robert West},
Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Home- page2Vec, a machine-learned pre-trained model for classifying and embedding… 

Figures and Tables from this paper


Ensemble approach for web page classification
An ensemble approach for web page classification is proposed by learning contextual representation using pre-trained bidirectional BERT and then applying deep Inception modelling with Residual connections for fine-tunes the target task by utilizing parallel multi-scale semantics.
Language-agnostic BERT Sentence Embedding
It is shown that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%, and a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba is released.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Exploiting link structure for web page genre identification
This study proposes a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links, and introduces a graph-based model called GenreSim, which selects an appropriate set of neighboring pages.
Website Classification Using Word Based Multiple N -Gram Models and Random Search Oriented Feature Parameters
The word-based multiple n-gram models for efficient feature extraction and multinomial distribution for Naive Bayes classifier under the Random Search pipeline for hyperparameter optimization that finds the best parameters of the URL features are introduced.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
A Heuristic Approach for Website Classification with Mixed Feature Extractors
  • Muyang Du, Yanni Han, Li Zhao
  • Computer Science
    2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
  • 2018
Compared to the multiple widely used machine learning models, results demonstrate the proposed classification schema outperforms the current models with the metrics precision, recall, F1, and accuracy.
Enhanced hypertext categorization using hyperlinks
This work has developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained and its technique also adapts gracefully to the fraction of neighboring documents having known topics.
Fast webpage classification using URL features
This work demonstrates the usefulness of the uniform resource locator (URL) alone in performing web page classification and shows that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link- based methods.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.