Katrin Braunschweig

Learn More
Following the Open Data trend, governments and public agencies have started making their data available to the public using web portals, web services or REST interfaces. Ideally, making this data available on the web would lead to more transparency, participation and innovation throughout society. However, just publishing the data on the web is not enough.(More)
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes(More)
Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by(More)
Following the Open Data trend, governments and public agencies have started making their data available on the Web and established platforms such as data.gov or data.un.org. These Open Data platforms provide a huge amount of data for various topics such as demographics, transport, finance or health in various data formats. One typical usage scenario for(More)
Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of(More)
In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML(More)
Named entity extraction is an established research area in the field of information extraction. When tailored to a specific domain and with sufficient pre-labeled training data, state-of-the-art extraction algorithms have achieved near human performance. However, when presented with semi-structured data, informal text or unknown domains where training data(More)
In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a(More)
Platforms for publication and collaborative management of data, such as <i>Data.gov</i> or <i>Google Fusion Tables</i>, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which(More)
Government initiatives for more transparency and participation have lead to an increasing amount of structured data on the web in recent years. Many of these datasets have great potential. For example, a situational analysis and meaningful visualization of the data can assist in pointing out social or economic issues and raising people’s awareness.(More)