A survey on session detection methods in query logs and a proposal for future evaluation

Abstract

Search engine logs provide a highly detailed insight of users' interactions. Hence, they are both extremely useful and sensitive. The datasets publicly available to scholars are, unfortunately , too few, too dated and too small. There are few because search engine companies are reluctant to release such data; they are dated because they were collected in late 1990s or early 2000s; and they are small because they comprise data for at most one day and just a few hundreds of thousands of users. Even worse, the large query log disclosed by AOL in 2006 caused more harm than good because of a big privacy flaw. In this paper the author provides an overall view of the possible applications of query logs, the privacy concerns researchers must face when working on such datasets, and several ways in which query logs can be easily sanitized. One of such measures consists of segmenting the logs into short topical sessions. Therefore, the author offers a comprehensive survey of session detection methods, as well as a thorough description of a new evaluation framework with performance results for each of the different methods. Additionally, a new, simple, but out-performing session detection method is proposed. It is a heuristic-based technique which works on the basis of a geometric interpretation of both the time gap between queries and the similarity between them in order to flag a topic shift. Web search companies keep log files detailing interaction of users with the search engine. The information typically recorded in these query logs includes a unique identifier for the user or the session, the query string, a timestamp and, occasionally , the results page number and the URLs clicked (if any) for each query. The analysis of such logs can provide an insight about searching behavior on the Web which is not only of interest for search engine companies but it notes the distinct features that differentiate Web information retrieval from classical IR. The first in-depth studies on query logs date back to the late 1990s (e.g. [27,33,68,69]). Such studies provided important details about Web searchers' behavior (e.g. query length, number of visited results, etc.). Nevertheless, query logs can not only be analyzed to understand users' activities but also mined to develop novel search-related applications such as query suggestion (e.g. [6,75,77]) or re-ranking of search results (e.g. [36,37]), among others. Nonetheless, the resources available to scholars working outside search engine …

DOI: 10.1016/j.ins.2009.01.026

Extracted Key Phrases

Showing 1-10 of 84 references

US versus European Web searching trends

  • A Spink
  • 2002
Highly Influential
10 Excerpts

Information Sciences

  • D Gayo-Avello
  • 2009

Anonymizing Query Logs, Query Log Analysis: Social and Technological Challenges

  • Adar, xxxxx User
  • 2007

Separating human and non-human Web queries

  • Y Zhang, A Moffat
  • 2007
Showing 1-10 of 38 extracted citations

Statistics

01020200920102011201220132014201520162017
Citations per Year

65 Citations

Semantic Scholar estimates that this publication has received between 39 and 112 citations based on the available data.

See our FAQ for additional information.