Pre-trained Language Model for Web-scale Retrieval in Baidu Search

@article{Liu2021PretrainedLM,
  title={Pre-trained Language Model for Web-scale Retrieval in Baidu Search},
  author={Yiding Liu and Guan Huang and Jiaxiang Liu and Weixue Lu and Suqi Cheng and Yukun Li and Daiting Shi and Shuaiqiang Wang and Zhicong Cheng and Dawei Yin},
  journal={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining},
  year={2021}
}
  • Yiding Liu, Guan Huang, +7 authors Dawei Yin
  • Published 7 June 2021
  • Computer Science
  • Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed… 

Figures and Tables from this paper

Intent-based Product Collections for E-commerce using Pretrained Language Models
  • Hiun Kim, Jisu Jeong, +7 authors Rak Yeong Kim
  • Computer Science
    2021 International Conference on Data Mining Workshops (ICDMW)
  • 2021
TLDR
Online experimental results on the e-commerce platform show that the PLM-based method can construct collections of products with increased CTR, CVR, and order-diversity compared to expert-crafted collections.
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval
  • Shitao Xiao, Zheng Liu, +9 authors Qi Zhang
  • Computer Science
  • 2022
TLDR
This work addresses the problem of massive-scale embedding-based retrieval with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embedDings are hosted in disk for fine- grained post verification.
Pre-trained Language Model based Ranking in Baidu Search
TLDR
A novel practice to cost-efficiently summarize the web document and contextualize the resultant summary content with the query using a cheap yet powerful Pyramid-ERNIE architecture and a human-anchored fine-tuning strategy tailored for the online ranking system, aiming to stabilize the ranking signals across various online components.

References

SHOWING 1-10 OF 77 REFERENCES
Pre-training Tasks for Embedding-based Large-scale Retrieval
TLDR
It is shown that the key ingredient of learning a strong embedding-based Transformer model is the set of pre- training tasks, and with adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers.
Learning deep structured semantic models for web search using clickthrough data
TLDR
A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed.
A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval
TLDR
A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed.
A Deep Architecture for Matching Short Texts
TLDR
This paper proposes a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains and applies this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question.
A Deep Relevance Matching Model for Ad-hoc Retrieval
TLDR
A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.
Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning
TLDR
It is shown that DPSR model outperforms existing models, and DPSR system can retrieve more personalized and semantically relevant items to significantly improve users' search experience by +1.29% conversion rate.
Semantic Modelling with Long-Short-Term Memory for Information Retrieval
TLDR
Experimental evaluation on an IR task derived from the Bing web search demonstrates the ability of the proposed method in addressing both lexical mismatch and long-term context modelling issues, thereby, significantly outperforming existing state of the art methods for web document retrieval task.
Learning semantic representations using convolutional neural networks for web search
TLDR
This paper presents a series of new latent semantic models based on a convolutional neural network to learn low-dimensional semantic vectors for search queries and Web documents that significantly outperforms other se-mantic models in retrieval performance.
Clickthrough-based latent semantic models for web search
TLDR
Two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR) are presented.
Learning to Match using Local and Distributed Representations of Text for Web Search
TLDR
This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.
...
1
2
3
4
5
...