The frequency of letters in text messages has often been studied for use in cryptography, and frequency analysis in particular. An exact analysis of this is not possible, as each person writes slightly differently; however, an approximate ordering of English letters by frequency of use is ETAOIN SHRDL UCMFG YPWBV KXJQZ.
This brings up an interesting point. Letter frequencies, like word frequencies, tend to vary, both by writer and by subject. One cannot talk about x-rays without using frequent Xs, and cannot use any letter if it is broken on one's keyboard. Letter, bigram, trigraph and word frequencies can be used to prove or disprove authorship of long texts. Things like average word and sentence length are also used. Everyone writes differently – Hemingway is not Faulkner, and so on. A precise average usage could only be gleaned by analyzing usage in a large mass of representative inputs.
Relative frequencies of letters
By letter | By frequency | ||
Letter | Frequency | Letter | Frequency |
a | 0.08167 | e | 0.12702 |
b | 0.01492 | t | 0.09056 |
c | 0.02782 | a | 0.08167 |
d | 0.04253 | o | 0.07507 |
e | 0.12702 | i | 0.06966 |
f | 0.02228 | n | 0.06749 |
g | 0.02015 | s | 0.06327 |
h | 0.06094 | h | 0.06094 |
i | 0.06966 | r | 0.05987 |
j | 0.00153 | d | 0.04253 |
k | 0.00772 | l | 0.04025 |
l | 0.04025 | c | 0.02782 |
m | 0.02406 | u | 0.02758 |
n | 0.06749 | m | 0.02406 |
o | 0.07507 | w | 0.02360 |
p | 0.01929 | f | 0.02228 |
q | 0.00095 | g | 0.02015 |
r | 0.05987 | y | 0.01974 |
s | 0.06327 | p | 0.01929 |
t | 0.09056 | b | 0.01492 |
u | 0.02758 | v | 0.00978 |
v | 0.00978 | k | 0.00772 |
w | 0.02360 | j | 0.00153 |
x | 0.00150 | x | 0.00150 |
y | 0.01974 | q | 0.00095 |
z | 0.00074 | z | 0.00074 |
Top 10 beginning of word letters
Letter | Frequency |
t | 0.1594 |
a | 0.155 |
i | 0.0823 |
s | 0.0775 |
o | 0.0712 |
c | 0.0597 |
m | 0.0426 |
f | 0.0408 |
p | 0.040 |
w | 0.0382 |
Top 10 end of word letters
Letter | Frequency |
e | 0.1917 |
s | 0.1435 |
d | 0.0923 |
t | 0.0864 |
n | 0.0786 |
y | 0.0730 |
r | 0.0693 |
o | 0.0467 |
l | 0.0456 |
f | 0.0408 |
Most common bigrams (in order)
th, he, in, en, nt, re, er, an, ti, es, on, at, se, nd, or, ar, al, te, co, de, to, ra, et, ed, it, sa, em, ro.Most common trigrams (in order)
the, and, tha, ent, ing, ion, tio, for, nde, has, nce, edt, tis, oft, sth, menResults from Project Gutenberg
Analysis of 9,481 English works (3.98 GiB) from Project Gutenberg (the extracted contents of the 2003 PG DVD, plain text files only, minus the human genome project, non-English works, and duplicates in 7-bit-clean encoding), after stripping off the common boilerplate text present in every file so as not to skew results, yielded the following frequencies of letters, bigrams, trigrams, and quadrigrams:Letters
Of 3,104,375,038 letters scanned:1. e (390395169, 12.575645%) 2. t (282039486, 9.085226%) 3. a (248362256, 8.000395%) 4. o (235661502, 7.591270%) 5. i (214822972, 6.920007%) 6. n (214319386, 6.903785%) 7. s (196844692, 6.340880%) 8. h (193607737, 6.236609%) 9. r (184990759, 5.959034%) 10. d (134044565, 4.317924%) 11. l (125951672, 4.057231%) 12. u (88219598, 2.841783%) 13. c (79962026, 2.575785%) 14. m (79502870, 2.560994%) 15. f (72967175, 2.350463%) 16. w (69069021, 2.224893%) 17. g (61549736, 1.982677%) 18. y (59010696, 1.900888%) 19. p (55746578, 1.795742%) 20. b (47673928, 1.535701%) 21. v (30476191, 0.981717%) 22. k (22969448, 0.739906%) 23. x (5574077, 0.179556%) 24. j (4507165, 0.145188%) 25. q (3649838, 0.117571%) 26. z (2456495, 0.079130%)
Bigrams
Of 2,383,373,483 bigrams scanned:1. th (92535489, 3.882543%) 2. he (87741289, 3.681391%) 3. in (54433847, 2.283899%) 4. er (51910883, 2.178042%) 5. an (51015163, 2.140460%) 6. re (41694599, 1.749394%) 7. nd (37466077, 1.571977%) 8. on (33802063, 1.418244%) 9. en (32967758, 1.383239%) 10. at (31830493, 1.335523%) 11. ou (30637892, 1.285484%) 12. ed (30406590, 1.275779%) 13. ha (30381856, 1.274742%) 14. to (27877259, 1.169655%) 15. or (27434858, 1.151094%) 16. it (27048699, 1.134891%) 17. is (26452510, 1.109877%) 18. hi (26033632, 1.092302%) 19. es (26033602, 1.092301%) 20. ng (25106109, 1.053385%)
Trigrams
Of 1,699,542,842 trigrams scanned:1. the (59623899, 3.508232%) 2. and (27088636, 1.593878%) 3. ing (19494469, 1.147042%) 4. her (13977786, 0.822444%) 5. hat (11059185, 0.650715%) 6. his (10141992, 0.596748%) 7. tha (10088372, 0.593593%) 8. ere (9527535, 0.560594%) 9. for (9438784, 0.555372%) 10. ent (9020688, 0.530771%) 11. ion (8607405, 0.506454%) 12. ter (7836576, 0.461099%) 13. was (7826182, 0.460487%) 14. you (7430619, 0.437213%) 15. ith (7329285, 0.431250%) 16. ver (7320472, 0.430732%) 17. all (7184955, 0.422758%) 18. wit (6752112, 0.397290%) 19. thi (6709729, 0.394796%) 20. tio (6425262, 0.378058%)
Quadrigrams
Of 1,144,085,293 quadrigrams scanned:1. that (8709261, 0.761242%) 2. ther (6916008, 0.604501%) 3. with (6565513, 0.573866%) 4. tion (6314428, 0.551919%) 5. here (4285164, 0.374549%) 6. ould (4232202, 0.369920%) 7. ight (3540253, 0.309440%) 8. have (3324067, 0.290544%) 9. hich (3252540, 0.284292%) 10. whic (3247213, 0.283826%) 11. this (3161481, 0.276333%) 12. thin (3093756, 0.270413%) 13. they (3002324, 0.262421%) 14. atio (3001919, 0.262386%) 15. ever (2982572, 0.260695%) 16. from (2958372, 0.258580%) 17. ough (2899649, 0.253447%) 18. were (2643859, 0.231089%) 19. hing (2630750, 0.229944%) 20. ment (2555284, 0.223347%)
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Letter frequency".
http://www.cryptograms.org
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.