Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research

  title={Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research},
  author={Walter Rweyemamu and Tobias Lauinger and Christo Wilson and William K. Robertson and Engin Kirda},
  booktitle={Passive and Active Network Measurement Conference},
Top domain rankings (e.g., Alexa) are commonly used in security research, such as to survey security features or vulnerabilities of “relevant” websites. Due to their central role in selecting a sample of sites to study, an inappropriate choice or use of such domain rankings can introduce unwanted biases into research results. We quantify various characteristics of three top domain lists that have not been reported before. For example, the weekend effect in Alexa and Umbrella causes these… 

Getting Under Alexa's Umbrella: Infiltration Attacks Against Internet Top Domain Lists

It is demonstrated that it is feasible to infiltrate two domain rankings with very little effort, and it is suggested that researchers should refrain from using these domain rankings to model benign behaviour.

Evaluating the Long-term Effects of Parameters on the Characteristics of the Tranco Top Sites Ranking

The long-term properties of the Tranco ranking are analyzed and whether it contains a balanced set of domains and how the default parameters of Tranco create a stable, robust and comprehensive ranking is examined.

Building an Open, Robust, and Stable Voting-Based Domain Top List

This paper systematically explores the construction of a domain top list from scratch using an extensive passive DNS dataset, and produces a voting-based domain ranking method which achieves better stability and manipulation resistance than existing top lists, while serving as an open and transparent ranking method that other researchers can use or adapt.

Toppling top lists: evaluating the accuracy of popular website lists

It is shown that most lists capture web popularity poorly, with the exception of the Chrome User Experience Report (CrUX) dataset, which is the most accurate top list compared to Cloudflare across all metrics.

Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services

This study empirically explores popular domain classification services, their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering, and concludes with actionable recommendations on their usage.

Prefix Top Lists: Gaining Insights with Prefixes from Domain-based Top Lists on DNS Deployment

It is shown that popular domains adhere to name server recommendations for IPv4, but IPv6 compliance is still lacking, and the concept of prefix top lists is presented, which ameliorate some of the shortcomings, while providing insights into the importance of addresses of domain-based top lists.

Analyzing the Web: Are Top Websites Lists a Good Choice for Research?

It is shown that top sites lists miss frequently visited websites and offer only little value for language-specific research, so a heuristic-driven alternative based on the Common Crawl host-level web graph is presented while also taking language- specific requirements into account.

Out of Sight, Out of Mind: Detecting Orphaned Web Pages at Internet-Scale

Overall, there is a clear hierarchy: Orphaned pages are the most vulnerable, followed by maintained pages on websites with orphans, with fully maintained sites being least vulnerable.

The web is still small after more than a decade

An empirical study to revisit web co-location using datasets collected from active DNS measurements shows that the web is still small and centralized to a handful of hosting providers, and analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought.

Assessing the Privacy Benefits of Domain Name Encryption

This paper assesses the privacy benefits of DNS over HTTPS/TLS and Encrypted SNI by considering the relationship between hostnames and IP addresses and quantifies the privacy gain offered by ESNI using two different metrics, the k -anonymity degree due to co-hosting and the dynamics of IP address changes.



A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists

It is found that top lists generally overestimate results compared to the general population by a significant margin, often even an order of magnitude, and some top lists have surprising change characteristics, causing high day-to-day fluctuation and leading to result instability.

Rigging Research Results by Manipulating Top Websites Rankings

How both inherent properties and vulnerabilities to adversarial manipulation of these rankings may affect the conclusions of security studies are uncovered.

Structure and Stability of Internet Top Lists

Investigating the aptness of frequently used top lists for empirical Internet scans, including stability, correlation, and potential biases of such lists is investigated.

Taster's choice: a comparative analysis of spam feeds

This paper compares the contents of ten distinct contemporaneous feeds of spam-advertised domain names to document significant variations based on how such feeds are collected and show how these variations can produce differences in findings as a result.

EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis

This paper introduces EXPOSURE, a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity, and uses 15 features that it extracts from the DNS traffic that allow it to characterize different properties of DNS names and the ways that they are queried.

Exposure: A Passive DNS Analysis Service to Detect and Report Malicious Domains

The Exposure system, a system designed to detect malicious domains in real time, by applying 15 unique features grouped in four categories, is presented and the results and lessons learned from 17 months of its operation are described.

Knowing your enemy: understanding and detecting malicious web advertising

A large-scale study through analyzing ad-related Web traces crawled over a three-month period reveals the rampancy of malvertising: hundreds of top ranking Web sites fell victims and leading ad networks such as DoubleClick were infiltrated.

Measuring HTTPS Adoption on the Web

This work gathers metrics to benchmark the status and progress of HTTPS adoption on the Web in 2017, and surveys server support for HTTPS among top and long-tail websites to gain insight into the current state of the HTTPS ecosystem.

Analysis of the HTTPS certificate ecosystem

A large-scale measurement study of the HTTPS certificate ecosystem---the public-key infrastructure that underlies nearly all secure web communications---is reported, uncovering practices that may put the security of the ecosystem at risk and identifying frequent configuration problems that lead to user-facing errors and potential vulnerabilities.

WarningBird: Detecting Suspicious URLs in Twitter Stream

This paper proposes WARNINGBIRD, a suspicious URL detection system for Twitter that considers correlated redirect chains of URLs in a number of tweets and trains a statistical classifier with features derived from correlated URLs and tweet context information.