Learn More
Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses(More)
We present a thorough investigation on using machine learning to construct effective personalized anti-spam filters. The investigation includes four learning algorithms, Naive Bayes, Flexible Bayes, LogitBoost, and Support Vector Machines, and four datasets, constructed from the mailboxes of different users. We discuss the model and search biases of the(More)
We present Filtron, a prototype anti-spam filter that integrates the main empirical conclusions of our comprehensive analysis on using machine learning to construct effective personalized anti-spam filters. Filtron is based on the experimental results over several design parameters on four publicly available benchmark corpora. After describing Filtron's(More)
Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used(More)
Unstructured text represents a large fraction of the world's data. It often contains snippets of structured information (e.g., people's names and zip codes). Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE(More)
The wide deployment of wireless sensor and RFID (Radio Frequency IDentification) devices is one of the key en-ablers for next-generation pervasive computing applications, including large-scale environmental monitoring and control, context-aware computing, and " smart digital homes ". Sensory readings are inherently unreliable and typically exhibit strong(More)
Triangle counting is an important problem in graph mining. The clustering coefficient and the transitivity ratio,two commonly used measures effectively quantify the triangle density in order to quantify the fact that friends of friends tend to be friends themselves. Furthermore, several successful graph mining applications rely on the number of triangles.(More)
Triangle counting is an important problem in graph mining. The clustering coefficient and the transitivity ratio, two commonly used measures effectively quantify the triangle density in order to quantify the fact that friends of friends tend to be friends themselves. Furthermore, several successful graph-mining applications rely on the number of triangles(More)
Full-text documents represent a large fraction of the world's data. Although not structured per se, they often contain snippets of struc-tured information within them: e.g., names, addresses, and document titles. Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two(More)