• Publications
  • Influence
Modern Information Retrieval - the concepts and technology behind search, Second edition
TLDR
This paper presents a meta-modelling architecture for search that automates the very labor-intensive and therefore time-heavy and expensive and expensive process of manually cataloging and querying documents.
A new approach to text searching
We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with don't care symbols and complement symbols, and multiple patterns. In
Information Retrieval: Data Structures and Algorithms
TLDR
For programmers and students interested in parsing text, automated indexing, its the first collection in book form of the basic data structures and algorithms that are critical to the storage and retrieval of documents.
FA*IR: A Fair Top-k Ranking Algorithm
TLDR
This work defines and solves the Fair Top-k Ranking problem, and presents an efficient algorithm, which is the first algorithm grounded in statistical tests that can mitigate biases in the representation of an under-represented group along a ranked list.
Design and Implementation of Relevance Assessments Using Crowdsourcing
TLDR
This work explores the design and execution of relevance judgments using Amazon Mechanical Turk as crowdsourcing platform, introducing a methodology for crowdsourcing relevance assessments and the results of a series of experiments using TREC 8 with a fixed budget.
Predicting The Next App That You Are Going To Use
TLDR
This paper model the prediction of the next app as a classification problem and proposes an effective personalized method to solve it that takes full advantage of human-engineered features and automatically derived features.
Link analysis for Web spam detection
TLDR
After tenfold cross-validation, the best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.
Improved query difficulty prediction for the web
TLDR
Improved Clarity is introduced, and it is demonstrated that it outperforms state-of-the-art predictors on three standard collections, including two large Web collections.
Using rank propagation and Probabilistic counting for Link-Based Spam Detection
TLDR
This paper proposes spam detection techniques that only consider the link structure of Web, regardless of page contents, and compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the Web graph.
Searching the Future
TLDR
A new retrieval problem: future retrieval is defined, which involves using news information to obtain future possible events and then search events related to the authors' current (or future) information needs, and includes time as a formal attribute for a document.
...
...