The predictability of letters in written english

@article{Schrmann2007ThePO,
  title={The predictability of letters in written english},
  author={Thomas Sch{\"u}rmann and Peter Grassberger},
  journal={ArXiv},
  year={2007},
  volume={abs/0710.4516}
}
We show that the predictability of letters in written English texts depends strongly on their position in the word. The first letters are usually the least easy to predict. This agrees with the intuitive notion that words are well defined subunits in written languages, with much weaker correlations across these units than within them. It implies that the average entropy of a letter deep inside a word is roughly 4–5 times smaller than the entropy of the first letter. 
4 Citations

Figures from this paper

The “handedness” of language: Directional symmetry breaking of sign usage in words
TLDR
This study shows that the occurrence probability distributions of signs at the left and right ends of words have a distinct heterogeneous nature and uses the existence of this asymmetry to infer the direction of writing in undeciphered inscriptions that agrees with the archaeological evidence.
Complexity-entropy analysis at different levels of organisation in written language
TLDR
It is shown that this can also be done at the different levels of organization of a text, and the type of analysis presented is reasonably general, and can be used to analyze the same balance in other complex messages such as DNA, where a hierarchy of organizational levels are known to exist.
Text Prediction in Web-based Text-Processing
TLDR
The author presents an inverted way of text prediction which might increase the accuracy of the predictions, and a longitudinal study to further understand the significant results.

References

SHOWING 1-10 OF 12 REFERENCES
Prediction and entropy of printed English
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends
A convergent gambling estimate of the entropy of English
TLDR
In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text by the Shannon-McMillan-Breiman theorem.
Entropy estimation of symbol sequences.
TLDR
Algorithms for estimating the Shannon entropy h of finite symbol sequences with long range correlations are considered, and a scaling law is proposed for extrapolation from finite sample lengths.
A universal finite memory source
TLDR
It is shown that this universal source incorporates any minimal data-generating tree machine in an asymptotically optimal manner in the following sense: the negative logarithm of the probability it assigns to any long typical sequence, generated by any tree machine, approaches that assigned by the tree machine at the best possible rate.
"J."
however (for it was the literal soul of the life of the Redeemer, John xv. io), is the peculiar token of fellowship with the Redeemer. That love to God (what is meant here is not God’s love to men)
A Mathematical Theory of Communication
TLDR
It is proved that the authors can get some positive data rate that has the same small error probability and also there is an upper bound of the data rate, which means they cannot achieve the data rates with any encoding scheme that has small enough error probability over the upper bound.
Text Compression
A collection of mixed English texts of newspapers (provided as ASCII-text by
  • A collection of mixed English texts of newspapers (provided as ASCII-text by
Collected Works provided as ASCII-text by Project Gutenberg Etext
  • Collected Works provided as ASCII-text by Project Gutenberg Etext
Evaluation of the entropy of a language by an improved prediction method with application to printed Hebrew
  • Evaluation of the entropy of a language by an improved prediction method with application to printed Hebrew
  • 1994
...
...