Naftali Tishby

Learn More
We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised(More)
We analyze the “query by committee” algorithm, a method for filtering informative queries from a random stream of inputs. We show that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queries. We show that, in particular, this exponential decrease(More)
We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the(More)
We introduce a novel distributional clustering algorithm that explicitly maximizes the mutual information per cluster between the data and given categories. This algorithm can be considered as a bottom up hard version of the recently introduced “Information Bottleneck Method”. We relate the mutual information between clusters and categories to the Bayesian(More)
Feature selection is the task of choosing a small set out of a given set of features that capture the relevant properties of the data. In the context of supervised classification problems the relevance is determined by the given labels on the training data. A good choice of features is a key for building compact and accurate classifiers. In this paper we(More)
We present a novel implementation of the recently introduced <i>information bottleneck method</i> for unsupervised document clustering. Given a joint empirical distribution of words and documents, <i>p</i>(<i>x</i>, <i>y</i>), we first cluster the words, <i>Y</i>, so that the obtained word clusters, Ytilde;, maximally preserve the information on the(More)
We present a novel sequential clustering algorithm which is motivated by the <i>Information Bottleneck (IB)</i> method. In contrast to the agglomerative <i>IB</i> algorithm, the new sequential (<i>sIB</i>) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information,(More)
We define predictive information I(pred)(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T:I(pred)(T) can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite(More)