Cobwebs from the Past and Present: Extracting Large Social Networks using Internet Archive Data

  title={Cobwebs from the Past and Present: Extracting Large Social Networks using Internet Archive Data},
  author={Miroslav Shaltev and Jan-Hendrik Zab and Philipp Kemkes and Stefan Siersdorfer and Sergej Zerr},
  journal={Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval},
  • M. Shaltev, Jan-Hendrik Zab, Sergej Zerr
  • Published 7 July 2016
  • Computer Science
  • Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
Social graph construction from various sources has been of interest to researchers due to its application potential and the broad range of technical challenges involved. The World Wide Web provides a huge amount of continuously updated data and information on a wide range of topics created by a variety of content providers, and makes the study of extracted people networks and their temporal evolution valuable for social as well as computer scientists. In this paper we present SocGraph - an… 

Figures from this paper

Concepts and tools for the effective and efficient use of web archives

This work presents a retrospective analysis of crawl metadata on the size, age and growth of a Web dataset, and proposes a programming framework for efficiently processing archival collections.

A Holistic View on Web Archives

Only all three views together provide the holistic view that is required to effectively work with web archives, which considers websites, pages or extracted facts as nodes in a graph.

Accessing web archives from different perspectives with potential synergies

A generic analysis schema is proposed that outlines a systematic way to study Web archives by approaching them from different zoom levels corresponding to the three presented views: user-, data and graph-centric.



Who With Whom And How?: Extracting Large Social Networks Using Search Engines

Novel methodologies for query-based search engine mining for efficient extraction of social networks from large amounts of Web data are introduced, using patterns in phrase queries for retrieving entity connections, and employing a bootstrapping approach for iteratively expanding the pattern set.

Superficial Method for Extracting Social Network for Academics Using Web Snippets

This paper demontrate the possibility of exploiting features in Web snippets returned by search engines for disambiguating entities and building relations among entities during the process of extracting social networks.

A system to extract social networks based on the processing of information obtained from Internet

An automatic system to extract social networks: a software designed to generate social networks by exploiting information which is already available on Internet through the use of common search engines such as Google or Yahoo is presented.

POLYPHONET: an advanced social network extraction system from the web

A social network extraction system called POLYPHONET is proposed, which employs several advanced techniques to extract relations of persons, detect groups of people, and obtain keywords for a person using Google.

Mining email social networks

This paper begins with a discussion of the infrastructure (including a novel use of Scientific Workflow software) and then discusses the approach to mining the email archives, and presents some preliminary results from the data analysis.

Building the Social Graph of the History of European Integration - A Pipeline for Humanist-Machine Interaction in the Digital Humanities

The approach taken by the History of Europe application is discussed, a demonstrator for the integration of human and machine computation that combines the power of face recognition technology with two distinctively different crowd-sourcing approaches to compute co-occurrences of persons in historical image sets.

Extracting Social Networks from Literary Fiction

The method involves character name chunking, quoted speech attribution and conversation detection given the set of quotes, which provides evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars.

Exploiting Web querying for Web People Search in WePS2

The experience of applying the WePS approaches developed in [20] in the context of WePS-2 Clustering Task is described, which is based on extracting named entities from the web pages and then querying the web to collecting co-occurrence statistics, which are used as additional similarity measures.

Spark: Cluster Computing with Working Sets

Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

The Stanford CoreNLP Natural Language Processing Toolkit

The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.