Xuan Hieu Phan

Learn More
This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages,(More)
Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical(More)
This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads(More)
Extracting data on the Web is an important information extraction task. Most existing approaches rely on wrappers which require human knowledge and user interaction during extraction. This paper proposes the use of conditional models as an alternative solution to this task. Deriving the strength of conditional models like maximum entropy and maximum entropy(More)
This paper presents an online algorithm for dependency parsing problems. We propose an adaptation of the passive and aggressive online learning algorithm to the dependency parsing domain. We evaluate the proposed algorithms on the 2007 CONLL Shared Task, and report errors analysis. Experimental results show that the system score is better than the average(More)
Discriminative sequential learning models like Conditional Random Fields (CRFs) have achieved significant success in several areas such as natural language processing or information extraction. Their key advantage is the ability to capture various non--independent and overlapping features of inputs. However, several unexpected pitfalls have a negative(More)
Image annotation is a process of finding appropriate semantic labels for images in order to obtain a more convenient way for indexing and searching images on the Web. This article proposes a novel method for image annotation based on combining feature-word distributions, which map from visual space to word space, and word-topic distributions, which form a(More)
We present a learning framework for structured support vector models in which boosting and bagging methods are used to construct ensemble models. We also propose a selection method which is based on a switching model among a set of outputs of individual classifiers when dealing with natural language parsing problems. The switching model uses subtrees mined(More)