Five letter words in English

I was going to make a little post about Wordle, but I go sidetracked exploring five letter words. At the same time, I had a bit of fun with regular expressions and some simple scripting with ZSH.

The start was to obtain lists of 5-letter words. One is available at the Stanford Graphbase site; the file sgb-words.txt contains "the 5757 five-letter words of English". Others are available through English-language wordlists, such as those in the Linux directory /usr/share/dict/. There are two such files for British and American English.

So we can start by gathering all these different lists of words, and also sorting Knuth's file so that it is in alphabetical order.

Here's how (assuming that we're in a writable directory that will contain these files):

grep -E '^[a-z]{5}$' /usr/share/dict/american-english > ./usa5.txt
grep -E '^[a-z]{5}$' /usr/share/dict/british-english > ./brit5.txt
sort sgb-words.txt > ./knuth5.txt

The regular expressions in the first two lines simply ask for words of exactly five letters made from lower-case characters. This eliminates proper names and words with apostrophes.

Now let's see how many words each file contains:

$ wc -l *5.txt

 5300 brit5.txt
 5757 knuth5.txt
 5150 usa5.txt
16207 total

Note the nice use of ZSH's powerful globbing features - one area which it is more powerful than BASH.

Now there are too many words to see exactly the differences between them, but to start let's list the words in brit5.txt which are not in usa5.txt, and also the words in usa5.txt which are not in brit5.txt:

$ grep -f usa5.txt -v brit5.txt > brit-usa_d.txt
$ grep -f brit5.txt -v usa5.txt > usa-bri_d.txt

I'm using a debased version of set difference for the output file, so that brit-usa_d.txt are those words in brit5.txt which are not in usa5.txt. I've added a _d to make a handle for globbing:

$ wc -l *_d.txt

 188 brit-usa_d.txt
  38 usa-brit_d.txt
 226 total

And now we can look at the contents of these files, using the handy Linux `column` command to print the output with fewer lines:

$ cat usa-brit_d.txt | column

arbor	 chili	fagot	feces	honor	miter	niter	rigor	savor	vapor
ardor	 color	favor	fiber	humor	molds	ocher	ruble	slier	vigor
armor	 dolor	fayer	furor	labor	moldy	odors	rumor	tumor
calks	 edema	fecal	grays	liter	molts	plows	saber	valor

Notice, as you may expect, that this file contains American spellings: "rigor" instead of British "rigour", "liter" instead of British "litre", and so on. However, the other file difference contains only a few spelling differences, and quite a lot of words not in the American wordlist:

$ cat brit-usa_d.txt | column

abaci	 blent	croci	flyer	hollo	liras	nitre	pupas	slily	togae	wrapt
aeons	 blest	curst	fogey	homie	litre	nosey	rajas	slyer	topis	wrier
agism	 blubs	dados	fondu	hooka	loxes	ochre	ranis	snuck	torah	yacks
ameer	 bocce	deers	frier	horsy	lupin	odour	recta	spacy	torsi	yocks
amirs	 bocci	dicky	gamey	huzza	macks	oecus	relit	spelt	tsars	yogin
amnia	 boney	didos	gaols	idyls	maths	panty	ropey	spick	tyres	yucks
ampul	 bosun	ditzy	gayly	ikons	matts	papaw	sabre	spilt	tzars	yuppy
amuck	 briar	djinn	gipsy	imbed	mavin	pease	saree	stogy	ulnas	zombi
appal	 brusk	drily	gismo	indue	metre	penes	sheik	styes	vacua
aquae	 bunko	enrol	gnawn	jehad	miaow	pigmy	sherd	swops	veldt
arses	 burqa	enure	greys	jinns	micra	pilau	shlep	synch	vitas
aunty	 caddy	eying	gybed	junky	mitre	pilaw	shoed	tabus	vizir
aurae	 calfs	eyrie	gybes	kabob	momma	pinky	shoon	tempi	vizor
baddy	 calif	faery	hadji	kebob	mould	podgy	shorn	thymi	welch
bassi	 celli	fayre	hallo	kerbs	moult	podia	shtik	tikes	whirr
baulk	 chapt	fezes	hanky	kiddy	mynah	pricy	siree	tipis	whizz
beaux	 clipt	fibre	heros	kopek	narks	prise	situp	tiros	wizes
bided	 coney	fiord	hoagy	leapt	netts	pryer	skyed	toffy	wooly

Of course, some of these words are spelled with a different number of letters in American English: for example the British "djinn" is the American "jinn"; the British "saree" is the American "sari".

Now of course we want to see how the Knuth file differs, as it's the file with the largest number of words:

$ grep -f usa5.txt -v knuth5.txt > knuth-usa_d.txt
$ grep -f brit5.txt -v knuth5.txt > knuth-brit_d.txt

$ wc -l knuth*_d.txt

  895 knuth-brit_d.txt
  980 knuth-usa_d.txt
 1875 total

Remarkably enough, there are also words in both the original files which are not in Knuth's list:

$ grep -f knuth5.txt -v usa5.txt > usa-knuth_d.txt
$ grep -f knuth5.txt -v brit5.txt > brit-knuth_d.txt

$ wc -l *knuth_d.txt

 438 brit-knuth_d.txt
 373 usa-knuth_d.txt
 811 total

So maybe our best bet would be to concatenate all the files, and take the all the words, leaving out any duplicates. Something like this:

$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -u > allu5.txt
$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -d > alld5.txt
$ cat allu5.txt alld5.txt | sort > all5.txt

The first line finds all the words which are unique - that is, that appear only once in the concatenated file, and the second line finds all the words which are repeated. These two lists are disjoint, and so may then be concatenated to form a master list, which can be found to contain 6196 words.

Surely this file is complete? Well, the English language is a great collector of words, and every year we find new words being used, many from other languages and cultures. Here are some words that are not in the all5.txt file:

Australian words: galah, dunny, smoko, durry, bogan, chook (there are almost certainly others)

Indian words: crore, lakhs, dosai, iddli, baati, chaat, kheer, kofta, kulfi, rasam, poori (the first two are numbers, the others are foods)

Scots words: canty, curch, flang, kythe, plack, routh, saugh, teugh, wadna - these are all used by Burns in his poems, which are written in English (admittedly a dialectical form of it).

New words: qango, fubar, crunk, noobs, vlogs, rando, vaper (the first two are excellent acronyms; the others are new words)

As with the Australian words, none of these lists are exhaustive; the full list of five-letter English words not in the file all5.txt would run probably into the many hundreds, maybe even thousands.

A note on word structures

I was curious about the numbers of vowels and consonants in words. To start, here's a little Julia function which encodes the positions of consonants as an integer between 0 and 31. For example, take the word "drive". We can encode this as [1,1,0,1,0] where the 1's are at the positions of the consonants. Then this can be considered as binary digits representing the number 27.

julia> function cvs(word)
           vs = indexin(collect(word),collect("aeiou"))
           vs2 = replace(x -> isnothing(x) ? 1 : 0,vs)
           return(sum(vs2 .* [16,8,4,2,1]))
       end

Now we simply walk through the words in all5.txt determining their values as we go, and keeping a running total:

julia> wds = readlines(open("all5.txt"))
julia> cv = zeros(Int16,32)
julia> for w in wds
           c = cvs(w)
           cv[c+1] = cv[c+1]+1
       end

julia> hcat(0:31,cv)
32×2 Matrix{Int64}:
  0     0
  1     0
  2     2
  3     1
  4     4
  5    48
  6    10
  7    19
  8     2
  9    61
 10    96
 11   156
 12    24
 13   262
 14    21
 15    24
 16     1
 17    34
 18   105
 19   585
 20    97
 21  1514
 22   432
 23  1158
 24     5
 25   301
 26   249
 27   832
 28    16
 29    96
 30    15
 31    26

We see that the most common patterns are 21 = 10101, and 23 = 10111. But what about some of the smaller values?

julia> for w in wds
           if cvs(w) == 24
               println(w)
           end
       end

stoae
wheee
whooo
xviii
xxiii

Yes, there are some Roman numerals hanging about, and probably they should be removed. And one more, 30 = 11110:

julia> for w in wds
           if cvs(w) == 30
               println(w)
           end
       end

chyme
clxvi
cycle
hydra
hydro
lycra
phyla
rhyme
schmo
schwa
style
styli
thyme
thymi
xxxvi

Again a few Roman numerals. These may need to be removed by hand. One way to do this is by using regular expressions again:

$ grep -E '[xlcvi]{5}' all5.txt
civic
civil
clvii
clxii
clxiv
clxix
clxvi
lxvii
villi
xcvii
xviii
xxiii
xxvii
xxxii
xxxiv
xxxix
xxxvi

and we see that we have 3 English words, and the rest Roman numerals. These can be deleted.