Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Abstract

A new method is proposed for clustering XML documents by structure-constrained phrases. It is implemented by three machine-learning approaches previously unexplored in the XML domain, namely non-negative matrix (tri-)factorization, co-clustering and automatic transactional clustering. A novel class of XML features approximately captures structure-constrained phrases as n-grams contextualized by root-to-leaf paths. Experiments over real-world benchmark XML corpora show that the effectiveness of the three approaches improves with contextualized n-grams of suitable length. This confirms the validity of the devised method from multiple clustering perspectives. Two approaches overcome in effectiveness several state-of-the-art competitors. The scalability of the three approaches is investigated, too.

DOI: 10.1007/s10791-017-9314-x

10 Figures and Tables

Cite this paper

@article{Costa2017MachineLT, title={Machine learning techniques for XML (co-)clustering by structure-constrained phrases}, author={Gianni Costa and Riccardo Ortale}, journal={Information Retrieval Journal}, year={2017}, pages={1-32} }