Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery

Abstract

A signiicant amount of data resides in loosely structured text collections. The concept of text mining has recently been introduced in order to utilize these resources in data mining driven decision making. In our approach, we consider nding multi-term text phrases that tend to co-occur in the documents of a document collection. We combine and further develop two techniques, nding frequent sequences and nding frequent sets, and discuss their suitabil-ity for text mining. The process presented in this paper contains two major phases. In the rst phase, maximal frequent sequences are extracted from documents, i.e., such sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. A sequence is considered to be frequent if it appears in at least documents, when is a given frequency threshold. For instance, we may require the sequences to occur in at least 10 documents. In the second phase, co-occurrences of the maximal frequent sequences are found by discovering frequent sets of the sequences, i.e., which sequences tend to co-occur in several documents. We have implemented the methods and experimented with a news collection. The experiments reveal many characteristics of textual data, which aaect the further development and application of the methods.

Cite this paper

@inproceedings{AhonenMyka1999FindingCT, title={Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery}, author={Helena Ahonen-Myka and Oskari Heinonen and Mika Klemettinen}, year={1999} }