Robust and distributed web-scale near-dup document conflation in microsoft academic service

@article{Wu2015RobustAD,
  title={Robust and distributed web-scale near-dup document conflation in microsoft academic service},
  author={Chieh-Han Wu and Yang Song},
  journal={2015 IEEE International Conference on Big Data (Big Data)},
  year={2015},
  pages={2606-2611}
}
  • Chieh-Han Wu, Yang Song
  • Published 29 October 2015
  • Computer Science
  • 2015 IEEE International Conference on Big Data (Big Data)
In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online… 
Data Curation and Quality Assurance for Machine Learning-based Cyber Intrusion Detection
TLDR
The research initiates a data quality perspective for researchers and practitioners to improve the performance of machine learning-based intrusion detection.
An Approach for Validating Quality of Datasets for Machine Learning
TLDR
An experimental study is presented to show how the quality of datasets impact the accuracy of machine learning models and proposes a novel technique based on metamorphic testing for validating a machine learning system together with its training and testing data.
A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data
TLDR
The design of the proposed big data verification and validation framework with CMA as the case study is described, and its effectiveness through verifying and validating the dataset, the software and the algorithms in CMA is demonstrated.
Building a Deep Learning Classifier for Enhancing a Biomedical Big Data Service
TLDR
A deep learning classifier is described that is rigorously validated with synthetic data generated by a collection of scientific tools to improve the effectiveness of data separation and design of big data services with data quality improvement as an integral component.
Building an SVM Classifier for Automated Selection of Big Data
TLDR
A support vector machine based approach for automated classification of big data so that the noisy data are classified as separated categories from the regular data and the performance of the SVM based classification and a deep learning based classification of the same data set is compared.
Data Evaluation and Enhancement for Quality Improvement of Machine Learning
TLDR
This article exposed the hidden quality problem in the datasets used to build a machine learning system for normalizing medical concepts in social media text and proposed a data quality evaluation framework that includes the quality criteria and their corresponding evaluation approaches.
A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning
Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the
Augmentation and evaluation of training data for deep learning
TLDR
This paper proposes a deep learning classifier for automatically separating good training data from noisy data and demonstrates the effectiveness of the proposed approach through an experimental investigation of automated classification of massive biomedical images.

References

SHOWING 1-6 OF 6 REFERENCES
Detecting near-duplicates for web crawling
TLDR
This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k.
Effective string processing and matching for author disambiguation
TLDR
An effective name matching framework is proposed and realizes two implementations to consider Chinese and non-Chinese names separately because of their different naming conventions and post-processing including merging results of two predictions further boosts the performance.
Identifying and Filtering Near-Duplicate Documents
TLDR
The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
TLDR
DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Similarity estimation techniques from rounding algorithms
TLDR
It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Greedy function approximation: A gradient boosting machine.
Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions