Five letter words in English
I was going to make a little post about Wordle, but I go sidetracked exploring five letter words. At the same time, I had a bit of fun with regular expressions and some simple scripting with ZSH.
The start was to obtain lists of 5-letter words. One is available at the
Stanford Graphbase site;
the file sgb-words.txt
contains "the 5757 five-letter words of English". Others
are available through English-language wordlists, such as those in the Linux
directory /usr/share/dict/
. There are two such files for British and American
English.
So we can start by gathering all these different lists of words, and also sorting Knuth's file so that it is in alphabetical order.
Here's how (assuming that we're in a writable directory that will contain these files):
1grep -E '^[a-z]{5}$' /usr/share/dict/american-english > ./usa5.txt
2grep -E '^[a-z]{5}$' /usr/share/dict/british-english > ./brit5.txt
3sort sgb-words.txt > ./knuth5.txt
The regular expressions in the first two lines simply ask for words of exactly five letters made from lower-case characters. This eliminates proper names and words with apostrophes.
Now let's see how many words each file contains:
1$ wc -l *5.txt
2
3 5300 brit5.txt
4 5757 knuth5.txt
5 5150 usa5.txt
616207 total
Note the nice use of ZSH's powerful globbing features - one area which it is more powerful than BASH.
Now there are too many words to see exactly the differences between them, but to
start let's list the words in brit5.txt
which are not in usa5.txt
, and also
the words in usa5.txt
which are not in brit5.txt
:
1$ grep -f usa5.txt -v brit5.txt > brit-usa_d.txt
2$ grep -f brit5.txt -v usa5.txt > usa-bri_d.txt
I'm using a debased version of set difference for the output file, so that
brit-usa_d.txt
are those words in brit5.txt
which are not in usa5.txt
.
I've added a _d
to make a handle for globbing:
1$ wc -l *_d.txt
2
3 188 brit-usa_d.txt
4 38 usa-brit_d.txt
5 226 total
And now we can look at the contents of these files, using the handy Linux `column` command to print the output with fewer lines:
1$ cat usa-brit_d.txt | column
2
3arbor chili fagot feces honor miter niter rigor savor vapor
4ardor color favor fiber humor molds ocher ruble slier vigor
5armor dolor fayer furor labor moldy odors rumor tumor
6calks edema fecal grays liter molts plows saber valor
Notice, as you may expect, that this file contains American spellings: "rigor" instead of British "rigour", "liter" instead of British "litre", and so on. However, the other file difference contains only a few spelling differences, and quite a lot of words not in the American wordlist:
1$ cat brit-usa_d.txt | column
2
3abaci blent croci flyer hollo liras nitre pupas slily togae wrapt
4aeons blest curst fogey homie litre nosey rajas slyer topis wrier
5agism blubs dados fondu hooka loxes ochre ranis snuck torah yacks
6ameer bocce deers frier horsy lupin odour recta spacy torsi yocks
7amirs bocci dicky gamey huzza macks oecus relit spelt tsars yogin
8amnia boney didos gaols idyls maths panty ropey spick tyres yucks
9ampul bosun ditzy gayly ikons matts papaw sabre spilt tzars yuppy
10amuck briar djinn gipsy imbed mavin pease saree stogy ulnas zombi
11appal brusk drily gismo indue metre penes sheik styes vacua
12aquae bunko enrol gnawn jehad miaow pigmy sherd swops veldt
13arses burqa enure greys jinns micra pilau shlep synch vitas
14aunty caddy eying gybed junky mitre pilaw shoed tabus vizir
15aurae calfs eyrie gybes kabob momma pinky shoon tempi vizor
16baddy calif faery hadji kebob mould podgy shorn thymi welch
17bassi celli fayre hallo kerbs moult podia shtik tikes whirr
18baulk chapt fezes hanky kiddy mynah pricy siree tipis whizz
19beaux clipt fibre heros kopek narks prise situp tiros wizes
20bided coney fiord hoagy leapt netts pryer skyed toffy wooly
Of course, some of these words are spelled with a different number of letters in American English: for example the British "djinn" is the American "jinn"; the British "saree" is the American "sari".
Now of course we want to see how the Knuth file differs, as it's the file with the largest number of words:
1$ grep -f usa5.txt -v knuth5.txt > knuth-usa_d.txt
2$ grep -f brit5.txt -v knuth5.txt > knuth-brit_d.txt
3
4$ wc -l knuth*_d.txt
5
6 895 knuth-brit_d.txt
7 980 knuth-usa_d.txt
8 1875 total
Remarkably enough, there are also words in both the original files which are not in Knuth's list:
1$ grep -f knuth5.txt -v usa5.txt > usa-knuth_d.txt
2$ grep -f knuth5.txt -v brit5.txt > brit-knuth_d.txt
3
4$ wc -l *knuth_d.txt
5
6 438 brit-knuth_d.txt
7 373 usa-knuth_d.txt
8 811 total
So maybe our best bet would be to concatenate all the files, and take the all the words, leaving out any duplicates. Something like this:
1$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -u > allu5.txt
2$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -d > alld5.txt
3$ cat allu5.txt alld5.txt | sort > all5.txt
The first line finds all the words which are unique - that is, that appear only once in the concatenated file, and the second line finds all the words which are repeated. These two lists are disjoint, and so may then be concatenated to form a master list, which can be found to contain 6196 words.
Surely this file is complete? Well, the English language is a great collector
of words, and every year we find new words being used, many from other languages
and cultures. Here are some words that are not in the all5.txt
file:
Australian words: galah, dunny, smoko, durry, bogan, chook (there are almost certainly others)
Indian words: crore, lakhs, dosai, iddli, baati, chaat, kheer, kofta, kulfi, rasam, poori (the first two are numbers, the others are foods)
Scots words: canty, curch, flang, kythe, plack, routh, saugh, teugh, wadna - these are all used by Burns in his poems, which are written in English (admittedly a dialectical form of it).
New words: qango, fubar, crunk, noobs, vlogs, rando, vaper (the first two are excellent acronyms; the others are new words)
As with the Australian words, none of these lists are exhaustive; the full list
of five-letter English words not in the file all5.txt
would run probably into
the many hundreds, maybe even thousands.
A note on word structures
I was curious about the numbers of vowels and consonants in words. To start, here's a little Julia function which encodes the positions of consonants as an integer between 0 and 31. For example, take the word "drive". We can encode this as [1,1,0,1,0] where the 1's are at the positions of the consonants. Then this can be considered as binary digits representing the number 27.
1julia> function cvs(word)
2 vs = indexin(collect(word),collect("aeiou"))
3 vs2 = replace(x -> isnothing(x) ? 1 : 0,vs)
4 return(sum(vs2 .* [16,8,4,2,1]))
5 end
Now we simply walk through the words in all5.txt
determining their values as
we go, and keeping a running total:
1julia> wds = readlines(open("all5.txt"))
2julia> cv = zeros(Int16,32)
3julia> for w in wds
4 c = cvs(w)
5 cv[c+1] = cv[c+1]+1
6 end
7
8julia> hcat(0:31,cv)
932×2 Matrix{Int64}:
10 0 0
11 1 0
12 2 2
13 3 1
14 4 4
15 5 48
16 6 10
17 7 19
18 8 2
19 9 61
20 10 96
21 11 156
22 12 24
23 13 262
24 14 21
25 15 24
26 16 1
27 17 34
28 18 105
29 19 585
30 20 97
31 21 1514
32 22 432
33 23 1158
34 24 5
35 25 301
36 26 249
37 27 832
38 28 16
39 29 96
40 30 15
41 31 26
We see that the most common patterns are 21 = 10101, and 23 = 10111. But what about some of the smaller values?
1julia> for w in wds
2 if cvs(w) == 24
3 println(w)
4 end
5 end
6
7stoae
8wheee
9whooo
10xviii
11xxiii
Yes, there are some Roman numerals hanging about, and probably they should be removed. And one more, 30 = 11110:
1julia> for w in wds
2 if cvs(w) == 30
3 println(w)
4 end
5 end
6
7chyme
8clxvi
9cycle
10hydra
11hydro
12lycra
13phyla
14rhyme
15schmo
16schwa
17style
18styli
19thyme
20thymi
21xxxvi
Again a few Roman numerals. These may need to be removed by hand. One way to do this is by using regular expressions again:
1$ grep -E '[xlcvi]{5}' all5.txt
2civic
3civil
4clvii
5clxii
6clxiv
7clxix
8clxvi
9lxvii
10villi
11xcvii
12xviii
13xxiii
14xxvii
15xxxii
16xxxiv
17xxxix
18xxxvi
and we see that we have 3 English words, and the rest Roman numerals. These can be deleted.