Literally better: Analyzing and improving the quality of literals

@article{Beek2018LiterallyBA,
  title={Literally better: Analyzing and improving the quality of literals},
  author={Wouter Beek and Filip Ilievski and Jeremy Debattista and Stefan Schlobach and Jan Wielemaker},
  journal={Semantic Web},
  year={2018},
  volume={9},
  pages={131-150}
}
Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that… 

Figures and Tables from this paper

A Scalable Framework for Quality Assessment of RDF Datasets
TLDR
This paper presents DistQualityAssessment – an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines and is the first distributed, in-memory approach for computing different quality metrics for large R DF datasets using Apache Spark.
Efficient Distributed In-Memory Processing of RDF Datasets
TLDR
A novel approach for statistical calculations of large RDF datasets, which scales out to clusters of machines and the first distributed in-memory approach for computing 32 different statistical criteria for RDF dataset using Apache Spark is described.
LOD-a-lot: A Single-File Enabler for Data Science
TLDR
There exists a wide collection of Data Science use cases that can be performed over such a LOD-a-lot file, which significantly reduces the cost and complexity of conducting Data Science.
Statistics about Data Shape Use in RDF Data
TLDR
Preliminary statistics about the use of SHACL core constraints in data shapes found on GitHub found that class, datatype and cardinality constraints are predominantly used, similar to the dominant use of domain and range in ontologies.
Evaluating the quality of the LOD cloud: An empirical investigation
TLDR
In this quantitative empirical survey, 130 datasets are analysed using 27 Linked Data quality metrics, using the Principal Component Analysis (PCA) test in order to identify the key quality indicators that can give sufficient information about a dataset’s quality.
Scalable Quality Assessment of Linked Data
TLDR
This thesis looks into the challenges faced to detect quality problems in linked datasets presenting quality results in a standardised machine-readable and interoperable format for which agents can make sense out of to help human consumers identifying the fitness for use dataset.
Web Semantics: Science, Services and Agents on the WorldWideWeb
TLDR
The results of two within-group user-centred studies of two online bibliographic systems using a widely deployed OPAC and its counterpart linked-data based system, datos.bne.es, show that users of the system based on linked data required significantly less time and visited fewer pages to complete a typical search and retrieval activity.
LOD-a-lot - A Queryable Dump of the LOD Cloud
TLDR
LOD-a-lot democratizes access to the Linked Open Data (LOD) Cloud by serving more than 28 billion unique triples from 650 K datasets over a single self-indexed file, enabling Web-scale repeatable experimentation and research even by standard laptops.
A Queryable Dump of the LOD Cloud
TLDR
LOD-a-lot democratizes the access to the Linked Open Data (LOD) Cloud by serving more than 28 billion unique triples from 650K datasets over a single self-indexed file, enabling Webscale repeatable experimentation and research even by standard laptops.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Luzzu -- A Framework for Linked Data Quality Assessment
TLDR
Luzzu is a framework for Linked Data Quality Assessment based on an extensible interface for defining new quality metrics, an interoperable, ontology-driven back-end for representing quality metadata and quality problems that can be reused within different semantic frameworks, a scalable stream processor for data dumps and SPARQL endpoints, and a customisable ranking algorithm taking into account user-defined weights.
Test-driven evaluation of linked data quality
TLDR
This work presents a methodology for test-driven quality assessment of Linked Data, which is inspired by test- driven software development, and argues that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality.
LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data
TLDR
The LOD Laundromat is presented, which removes stains from data without any human intervention and is able to make very large amounts of LOD more easily available for further processing right now.
LOTUS: Adaptive Text Search for Big Linked Data
TLDR
The ease with which LOTUS enables text-based resource retrieval at an unprecedented scale in concrete and domain-specific scenarios is demonstrated and the scalability of LOTUS with respect to the LOD Laundromat is provided.
Weaving the Pedantic Web
TLDR
This paper discusses common errors in RDF publishing, their consequences for applications, along with possible publisher-oriented approaches to improve the quality of structured, machine-readable and open data on the Web.
Quality assessment for Linked Data: A Survey
TLDR
A systematic review of approaches for assessing the quality of Linked Data, which unify and formalize commonly used terminologies across papers related to data quality and provides a comprehensive list of 18 quality dimensions and 69 metrics.
Sieve: linked data quality assessment and fusion
TLDR
Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.
What's up LOD Cloud? Observing The State of Linked Open Data Cloud Metadata
TLDR
Roomba is developed, a tool that enables to validate, correct and generate dataset metadata, and it is shown that the automatic corrections done by Roomba increase the overall quality of the datasets metadata and highlight the need for manual efforts to correct some important missing information.
ClioPatria: A SWI-Prolog infrastructure for the Semantic Web
TLDR
ClioPatria is a comprehensive semantic web development framework based on SWI-Prolog that extends this core with a SPARQL and LOD server, an extensible web frontend to manage the server, browse the data, query the data using SParQL and Prolog and a Git-based plugin manager.
Towards a vocabulary for data quality management in semantic web architectures
TLDR
This paper provides a conceptual model that allows the representation of data quality rules and other quality-related knowledge using the Resource Description Framework (RDF) and the Web Ontology Language (OWL).
...
...