General type-token distribution

  title={General type-token distribution},
  author={Shohei Hidaka},
  • S. Hidaka
  • Published 2 May 2013
  • Mathematics
  • Biometrika
We consider the problem of estimating the number of types in a corpus using the number of types observed in a sample of tokens from that corpus. We derive exact and asymptotic distributions for the number of observed types, conditioned on the number of tokens and the latent type distribution. We use the asymptotic distributions to derive an estimator of the latent number of types and validate this estimator numerically. 

Figures from this paper

Estimating the latent number of types in growing corpora with reduced cost-accuracy trade-off.
  • S. Hidaka
  • Psychology, Medicine
  • Journal of child language
  • 2016
This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children, and proposes a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods. Expand
Type-token models: a comparative study
The type (V) – token (N) relationship has been studied for almost a century and a number of models have been developed to examine this relationship, but comparative studies have been rare. Expand
Modelling Population Size Using Horvitz-Thompson Approach Based on the Zero-Truncated Poisson Lindley Distribution
The simulation results show that the Horvitz-Thompson estimator based on the zero-truncated Poisson Lindley distribution for modelling the population size provides a good fit when compared to thezero-trunked Poisson distribution. Expand
Statistical methods for biodiversity assessment
This thesis focuses on statistical methods for estimating the number of species which is a natural index for measuring biodiversity. Both parametric and nonparametric approaches are investigated forExpand
Leveraging mutual exclusivity for faster cross-situational word learning: A theoretical analysis
The 39th Annual Meeting of the Cognitive Science Society (CogSci) (London, UK, 26-29 July 2017) aims to advance the understanding of why language impairment is a major cause of disability in people with autism. Expand
Quantifying temporal trends in biodiversity , and how they vary spatially
Guppies inhabit streams in Trinidad and habitats can be categorised into high and low predation areas. Experimental transplants of guppies from high to low predation streams were performed in 2008Expand


This paper considers certain stochastic models for token and type counts in literary texts. Elaborating on some models of Gani, it is shown that reasonable fits can be obtained to some data of YuleExpand
Good-Turing Frequency Estimation Without Tears
The Simple Good–Turing estimator is defined, which is straightforward to use and performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques. Expand
How useful is the logarithmic type/token ratio?
What has been described as ‘one of the most remarkable (facts) in quantitative linguistics’ is the constancy of the logarithmic type/token ratio. If V denotes vocabulary and N text length, then logExpand
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Expand
Sampling from Dirichlet partitions: estimating the number of species
The Dirichlet partition of an interval can be viewed as the generalization of several classical models in ecological statistics. We recall the unordered Ewens sampling formulae -ESF) from finiteExpand
On the relation between the type–token and species-area problems
The species-area problem in biology and the type-token problem in literary studies are analogues of one another but have nearly disjoint literatures. Here their relationship is treated, a critique ofExpand
How Variable May a Constant be? Measures of Lexical Richness in Perspective
The results suggest that the empirical trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship. Expand
Estimating the Number of Species: A Review
How many kinds are there? Suppose that a population is partitioned into C classes. In many situations interest focuses not on estimation of the relative sizes of the classes, but on estimation of CExpand
Measures of Lexical Richness
Lexical richness is about the quality of vocabulary in a language sample. For some, this is equated with the variety of lexis, while for others it is a multidimensional concept. Keywords: Expand
Nonparametric estimation of the number of classes in a population
On applique la methode d'Efron (1981, 1982) a la construction d'intervalles de confiance bases sur des distributions du bootstrap