Pre-trained Language Model for Web-scale Retrieval in Baidu Search

  title={Pre-trained Language Model for Web-scale Retrieval in Baidu Search},
  author={Yiding Liu and Guan Huang and Jiaxiang Liu and Weixue Lu and Suqi Cheng and Yukun Li and Daiting Shi and Shuaiqiang Wang and Zhicong Cheng and Dawei Yin},
  journal={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining},
  • Yiding Liu, Guan Huang, +7 authors Dawei Yin
  • Published 7 June 2021
  • Computer Science
  • Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed… Expand

Figures and Tables from this paper

Intent-based Product Collections for E-commerce using Pretrained Language Models
  • Hiun Kim, Jisu Jeong, +7 authors Rak Yeong Kim
  • Computer Science
  • ArXiv
  • 2021
A pretrained language model (PLM) is used that leverages textual attributes of web-scale products to make intent-based product collections and significantly outperforms the search-based baseline model for intent- based product matching in offline evaluations. Expand
Pre-trained Language Model based Ranking in Baidu Search
A novel practice to cost-efficiently summarize the web document and contextualize the resultant summary content with the query using a cheap yet powerful Pyramid-ERNIE architecture and a human-anchored fine-tuning strategy tailored for the online ranking system, aiming to stabilize the ranking signals across various online components. Expand


Pre-training Tasks for Embedding-based Large-scale Retrieval
It is shown that the key ingredient of learning a strong embedding-based Transformer model is the set of pre- training tasks, and with adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. Expand
Learning deep structured semantic models for web search using clickthrough data
A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed. Expand
A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval
A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed. Expand
A Deep Architecture for Matching Short Texts
This paper proposes a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains and applies this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. Expand
A Deep Relevance Matching Model for Ad-hoc Retrieval
A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models. Expand
Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning
It is shown that DPSR model outperforms existing models, and DPSR system can retrieve more personalized and semantically relevant items to significantly improve users' search experience by +1.29% conversion rate. Expand
Semantic Modelling with Long-Short-Term Memory for Information Retrieval
Experimental evaluation on an IR task derived from the Bing web search demonstrates the ability of the proposed method in addressing both lexical mismatch and long-term context modelling issues, thereby, significantly outperforming existing state of the art methods for web document retrieval task. Expand
Learning semantic representations using convolutional neural networks for web search
This paper presents a series of new latent semantic models based on a convolutional neural network to learn low-dimensional semantic vectors for search queries and Web documents that significantly outperforms other se-mantic models in retrieval performance. Expand
Clickthrough-based latent semantic models for web search
Two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR) are presented. Expand
Learning to Match using Local and Distributed Representations of Text for Web Search
This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations. Expand