Five letter words in English
I was going to make a little post about Wordle, but I go sidetracked exploring five letter words. At the same time, I had a bit of fun with regular expressions and some simple scripting with ZSH.
The start was to obtain lists of 5-letter words. One is available at the
Stanford Graphbase site;
the file sgb-words.txt
contains "the 5757 five-letter words of English". Others
are available through English-language wordlists, such as those in the Linux
directory /usr/share/dict/
. There are two such files for British and American
English.
So we can start by gathering all these different lists of words, and also sorting Knuth's file so that it is in alphabetical order.
Here's how (assuming that we're in a writable directory that will contain these files):
grep -E '^[a-z]{5}$' /usr/share/dict/american-english > ./usa5.txt
grep -E '^[a-z]{5}$' /usr/share/dict/british-english > ./brit5.txt
sort sgb-words.txt > ./knuth5.txt
The regular expressions in the first two lines simply ask for words of exactly five letters made from lower-case characters. This eliminates proper names and words with apostrophes.
Now let's see how many words each file contains:
$ wc -l *5.txt
5300 brit5.txt
5757 knuth5.txt
5150 usa5.txt
16207 total
Note the nice use of ZSH's powerful globbing features - one area which it is more powerful than BASH.
Now there are too many words to see exactly the differences between them, but to
start let's list the words in brit5.txt
which are not in usa5.txt
, and also
the words in usa5.txt
which are not in brit5.txt
:
$ grep -f usa5.txt -v brit5.txt > brit-usa_d.txt
$ grep -f brit5.txt -v usa5.txt > usa-bri_d.txt
I'm using a debased version of set difference for the output file, so that
brit-usa_d.txt
are those words in brit5.txt
which are not in usa5.txt
.
I've added a _d
to make a handle for globbing:
$ wc -l *_d.txt
188 brit-usa_d.txt
38 usa-brit_d.txt
226 total
And now we can look at the contents of these files, using the handy Linux `column` command to print the output with fewer lines:
$ cat usa-brit_d.txt | column
arbor chili fagot feces honor miter niter rigor savor vapor
ardor color favor fiber humor molds ocher ruble slier vigor
armor dolor fayer furor labor moldy odors rumor tumor
calks edema fecal grays liter molts plows saber valor
Notice, as you may expect, that this file contains American spellings: "rigor" instead of British "rigour", "liter" instead of British "litre", and so on. However, the other file difference contains only a few spelling differences, and quite a lot of words not in the American wordlist:
$ cat brit-usa_d.txt | column
abaci blent croci flyer hollo liras nitre pupas slily togae wrapt
aeons blest curst fogey homie litre nosey rajas slyer topis wrier
agism blubs dados fondu hooka loxes ochre ranis snuck torah yacks
ameer bocce deers frier horsy lupin odour recta spacy torsi yocks
amirs bocci dicky gamey huzza macks oecus relit spelt tsars yogin
amnia boney didos gaols idyls maths panty ropey spick tyres yucks
ampul bosun ditzy gayly ikons matts papaw sabre spilt tzars yuppy
amuck briar djinn gipsy imbed mavin pease saree stogy ulnas zombi
appal brusk drily gismo indue metre penes sheik styes vacua
aquae bunko enrol gnawn jehad miaow pigmy sherd swops veldt
arses burqa enure greys jinns micra pilau shlep synch vitas
aunty caddy eying gybed junky mitre pilaw shoed tabus vizir
aurae calfs eyrie gybes kabob momma pinky shoon tempi vizor
baddy calif faery hadji kebob mould podgy shorn thymi welch
bassi celli fayre hallo kerbs moult podia shtik tikes whirr
baulk chapt fezes hanky kiddy mynah pricy siree tipis whizz
beaux clipt fibre heros kopek narks prise situp tiros wizes
bided coney fiord hoagy leapt netts pryer skyed toffy wooly
Of course, some of these words are spelled with a different number of letters in American English: for example the British "djinn" is the American "jinn"; the British "saree" is the American "sari".
Now of course we want to see how the Knuth file differs, as it's the file with the largest number of words:
$ grep -f usa5.txt -v knuth5.txt > knuth-usa_d.txt
$ grep -f brit5.txt -v knuth5.txt > knuth-brit_d.txt
$ wc -l knuth*_d.txt
895 knuth-brit_d.txt
980 knuth-usa_d.txt
1875 total
Remarkably enough, there are also words in both the original files which are not in Knuth's list:
$ grep -f knuth5.txt -v usa5.txt > usa-knuth_d.txt
$ grep -f knuth5.txt -v brit5.txt > brit-knuth_d.txt
$ wc -l *knuth_d.txt
438 brit-knuth_d.txt
373 usa-knuth_d.txt
811 total
So maybe our best bet would be to concatenate all the files, and take the all the words, leaving out any duplicates. Something like this:
$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -u > allu5.txt
$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -d > alld5.txt
$ cat allu5.txt alld5.txt | sort > all5.txt
The first line finds all the words which are unique - that is, that appear only once in the concatenated file, and the second line finds all the words which are repeated. These two lists are disjoint, and so may then be concatenated to form a master list, which can be found to contain 6196 words.
Surely this file is complete? Well, the English language is a great collector
of words, and every year we find new words being used, many from other languages
and cultures. Here are some words that are not in the all5.txt
file:
Australian words: galah, dunny, smoko, durry, bogan, chook (there are almost certainly others)
Indian words: crore, lakhs, dosai, iddli, baati, chaat, kheer, kofta, kulfi, rasam, poori (the first two are numbers, the others are foods)
Scots words: canty, curch, flang, kythe, plack, routh, saugh, teugh, wadna - these are all used by Burns in his poems, which are written in English (admittedly a dialectical form of it).
New words: qango, fubar, crunk, noobs, vlogs, rando, vaper (the first two are excellent acronyms; the others are new words)
As with the Australian words, none of these lists are exhaustive; the full list
of five-letter English words not in the file all5.txt
would run probably into
the many hundreds, maybe even thousands.
A note on word structures
I was curious about the numbers of vowels and consonants in words. To start, here's a little Julia function which encodes the positions of consonants as an integer between 0 and 31. For example, take the word "drive". We can encode this as [1,1,0,1,0] where the 1's are at the positions of the consonants. Then this can be considered as binary digits representing the number 27.
julia> function cvs(word)
vs = indexin(collect(word),collect("aeiou"))
vs2 = replace(x -> isnothing(x) ? 1 : 0,vs)
return(sum(vs2 .* [16,8,4,2,1]))
end
Now we simply walk through the words in all5.txt
determining their values as
we go, and keeping a running total:
julia> wds = readlines(open("all5.txt"))
julia> cv = zeros(Int16,32)
julia> for w in wds
c = cvs(w)
cv[c+1] = cv[c+1]+1
end
julia> hcat(0:31,cv)
32×2 Matrix{Int64}:
0 0
1 0
2 2
3 1
4 4
5 48
6 10
7 19
8 2
9 61
10 96
11 156
12 24
13 262
14 21
15 24
16 1
17 34
18 105
19 585
20 97
21 1514
22 432
23 1158
24 5
25 301
26 249
27 832
28 16
29 96
30 15
31 26
We see that the most common patterns are 21 = 10101, and 23 = 10111. But what about some of the smaller values?
julia> for w in wds
if cvs(w) == 24
println(w)
end
end
stoae
wheee
whooo
xviii
xxiii
Yes, there are some Roman numerals hanging about, and probably they should be removed. And one more, 30 = 11110:
julia> for w in wds
if cvs(w) == 30
println(w)
end
end
chyme
clxvi
cycle
hydra
hydro
lycra
phyla
rhyme
schmo
schwa
style
styli
thyme
thymi
xxxvi
Again a few Roman numerals. These may need to be removed by hand. One way to do this is by using regular expressions again:
$ grep -E '[xlcvi]{5}' all5.txt
civic
civil
clvii
clxii
clxiv
clxix
clxvi
lxvii
villi
xcvii
xviii
xxiii
xxvii
xxxii
xxxiv
xxxix
xxxvi
and we see that we have 3 English words, and the rest Roman numerals. These can be deleted.