• Corpus ID: 219966011

Improving Query Safety at Pinterest

  title={Improving Query Safety at Pinterest},
  author={A. Mahabal and Yinrui Li and Rajat Raina and Daniel W. Sun and Revati Mahajan and Jure Leskovec},
Query recommendations in search engines is a double edged sword, with undeniable benefits but potential of harm. Identifying unsafe queries is necessary to protect users from inappropriate query suggestions. However, identifying these is non-trivial because of the linguistic diversity resulting from large vocabularies, social-group-specific slang and typos, and because the inappropriateness of a term depends on the context. Here we formulate the problem as query-set expansion, where we are… 

Figures and Tables from this paper

Producing Usable Taxonomies Cheaply and Rapidly at Pinterest Using Discovered Dynamic μ-Topics

Pincepts cover all areas of user interest and automatically adjust to the specificity of user interests and are thus suitable for the creation of various kinds of taxonomies, allowing curators’ domain knowledge to be heavily leveraged and allowing further cost reduction.



Deep learning for detecting inappropriate content in text

  • Computer Science
    International Journal of Data Science and Analytics
  • 2017
A novel deep learning architecture called “Convolutional Bi-Directional LSTM (C-BiLSTM)" is proposed which combines the strengths of both Convolution Neural Networks (CNN) and Bi-directional L STMs (BLSTM) and it is revealed that they significantly outperform both pattern-based and other hand-crafted feature-based baselines.

Language-Independent Set Expansion of Named Entities Using the Web

This paper proposes a novel method for expanding sets of named entities that can be applied to semi-structured documents written in any markup language and in any human language and shows that this system is superior to Google Sets in terms of mean average precision.

SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble

This work focuses on corpus-based set expansion, which is a critical task in knowledge discovery and may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search.

Ex Machina: Personal Attacks Seen at Scale

A method that combines crowdsourcing and machine learning to analyze personal attacks at scale is developed and illustrated, and an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate is shown.

Semi-supervised learning with graphs

A series of novel semi-supervised learning approaches arising from a graph representation, where labeled and unlabeled instances are represented as vertices, and edges encode the similarity between instances are presented.

Detecting offensive tweets via topical feature discovery over a large scale twitter corpus

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via

Transductive Label Augmentation for Improved Deep Network Learning

This paper starts from a small, curated labeled dataset and lets the labels propagate through a larger set of unlabeled data using graph transduction techniques, and shows that by using known game theoretic transductive processes the authors can create larger and accurate enough labeled datasets which use results in better trained neural networks.

Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information

This work proposes formulae and indicators that describe the behavior of variants of Pointwise Mutual Information in a precise way so that researchers and practitioners can make a more informed decision as to which measure to use in different scenarios.

Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network

This paper introduces a new method based on a deep neural network combining convolutional and gated recurrent networks that is able to capture both word sequence and order information in short texts and sets new benchmark by outperforming on 6 out of 7 datasets by between 1 and 13% in F1.

SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text

A dataset of 3144 tweets, selected based on the presence of colloquial slang related to smoking and analyzed, which paves the way for further analysis, such as understanding sentiment or style, which makes this dataset a vital contribution to both disease surveillance and tobacco use research.