Constantine D. Spyropoulos

Learn More
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect(More)
It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (“spam”). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization,(More)
We investigate the performance of two machine learning algorithms in the context of antispam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian(More)
This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the(More)
We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial email, or “spam”, floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we(More)
This paper is a survey of recent work in the field of web usage mining for the benefitof research on the personalization of Web-based information services. The essence of personalization is the adaptability of information systems to the needs of their users. This issue is becoming increasingly important on the Web, as non-expert users are overwhelmed by the(More)
In this paper we examine the acquisition of user stereotypes and communities automatically from users' data. Stereotypes are built using supervised learning (C4.5) on personal data extracted from a set of questionnaires answered by the users of a news filtering system. Particular emphasis is given to the characteristic features of the task of learning(More)
In this paper we present a new computationally efficient algorithm for inducing context-free grammars that is able to learn from positive sample sentences. This new algorithm uses simplicity as a criterion for directing inference, and the search process of the new algorithm has been optimised by utilising the results of a theoretical analysis regarding the(More)
This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and(More)
This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system, constructed with the use of machine learning, to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based(More)