Five letter words in English

Share on:

I was going to make a little post about Wordle, but I go sidetracked exploring five letter words. At the same time, I had a bit of fun with regular expressions and some simple scripting with ZSH.

The start was to obtain lists of 5-letter words. One is available at the Stanford Graphbase site; the file sgb-words.txt contains "the 5757 five-letter words of English". Others are available through English-language wordlists, such as those in the Linux directory /usr/share/dict/. There are two such files for British and American English.

So we can start by gathering all these different lists of words, and also sorting Knuth's file so that it is in alphabetical order.

Here's how (assuming that we're in a writable directory that will contain these files):

1grep -E '^[a-z]{5}$' /usr/share/dict/american-english > ./usa5.txt
2grep -E '^[a-z]{5}$' /usr/share/dict/british-english > ./brit5.txt
3sort sgb-words.txt > ./knuth5.txt

The regular expressions in the first two lines simply ask for words of exactly five letters made from lower-case characters. This eliminates proper names and words with apostrophes.

Now let's see how many words each file contains:

1$ wc -l *5.txt
2
3 5300 brit5.txt
4 5757 knuth5.txt
5 5150 usa5.txt
616207 total

Note the nice use of ZSH's powerful globbing features - one area which it is more powerful than BASH.

Now there are too many words to see exactly the differences between them, but to start let's list the words in brit5.txt which are not in usa5.txt, and also the words in usa5.txt which are not in brit5.txt:

1$ grep -f usa5.txt -v brit5.txt > brit-usa_d.txt
2$ grep -f brit5.txt -v usa5.txt > usa-bri_d.txt

I'm using a debased version of set difference for the output file, so that brit-usa_d.txt are those words in brit5.txt which are not in usa5.txt. I've added a _d to make a handle for globbing:

1$ wc -l *_d.txt
2
3 188 brit-usa_d.txt
4  38 usa-brit_d.txt
5 226 total

And now we can look at the contents of these files, using the handy Linux `column` command to print the output with fewer lines:

1$ cat usa-brit_d.txt | column
2
3arbor	 chili	fagot	feces	honor	miter	niter	rigor	savor	vapor
4ardor	 color	favor	fiber	humor	molds	ocher	ruble	slier	vigor
5armor	 dolor	fayer	furor	labor	moldy	odors	rumor	tumor
6calks	 edema	fecal	grays	liter	molts	plows	saber	valor

Notice, as you may expect, that this file contains American spellings: "rigor" instead of British "rigour", "liter" instead of British "litre", and so on. However, the other file difference contains only a few spelling differences, and quite a lot of words not in the American wordlist:

 1$ cat brit-usa_d.txt | column
 2
 3abaci	 blent	croci	flyer	hollo	liras	nitre	pupas	slily	togae	wrapt
 4aeons	 blest	curst	fogey	homie	litre	nosey	rajas	slyer	topis	wrier
 5agism	 blubs	dados	fondu	hooka	loxes	ochre	ranis	snuck	torah	yacks
 6ameer	 bocce	deers	frier	horsy	lupin	odour	recta	spacy	torsi	yocks
 7amirs	 bocci	dicky	gamey	huzza	macks	oecus	relit	spelt	tsars	yogin
 8amnia	 boney	didos	gaols	idyls	maths	panty	ropey	spick	tyres	yucks
 9ampul	 bosun	ditzy	gayly	ikons	matts	papaw	sabre	spilt	tzars	yuppy
10amuck	 briar	djinn	gipsy	imbed	mavin	pease	saree	stogy	ulnas	zombi
11appal	 brusk	drily	gismo	indue	metre	penes	sheik	styes	vacua
12aquae	 bunko	enrol	gnawn	jehad	miaow	pigmy	sherd	swops	veldt
13arses	 burqa	enure	greys	jinns	micra	pilau	shlep	synch	vitas
14aunty	 caddy	eying	gybed	junky	mitre	pilaw	shoed	tabus	vizir
15aurae	 calfs	eyrie	gybes	kabob	momma	pinky	shoon	tempi	vizor
16baddy	 calif	faery	hadji	kebob	mould	podgy	shorn	thymi	welch
17bassi	 celli	fayre	hallo	kerbs	moult	podia	shtik	tikes	whirr
18baulk	 chapt	fezes	hanky	kiddy	mynah	pricy	siree	tipis	whizz
19beaux	 clipt	fibre	heros	kopek	narks	prise	situp	tiros	wizes
20bided	 coney	fiord	hoagy	leapt	netts	pryer	skyed	toffy	wooly

Of course, some of these words are spelled with a different number of letters in American English: for example the British "djinn" is the American "jinn"; the British "saree" is the American "sari".

Now of course we want to see how the Knuth file differs, as it's the file with the largest number of words:

1$ grep -f usa5.txt -v knuth5.txt > knuth-usa_d.txt
2$ grep -f brit5.txt -v knuth5.txt > knuth-brit_d.txt
3
4$ wc -l knuth*_d.txt
5
6  895 knuth-brit_d.txt
7  980 knuth-usa_d.txt
8 1875 total

Remarkably enough, there are also words in both the original files which are not in Knuth's list:

1$ grep -f knuth5.txt -v usa5.txt > usa-knuth_d.txt
2$ grep -f knuth5.txt -v brit5.txt > brit-knuth_d.txt
3
4$ wc -l *knuth_d.txt
5
6 438 brit-knuth_d.txt
7 373 usa-knuth_d.txt
8 811 total

So maybe our best bet would be to concatenate all the files, and take the all the words, leaving out any duplicates. Something like this:

1$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -u > allu5.txt
2$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -d > alld5.txt
3$ cat allu5.txt alld5.txt | sort > all5.txt

The first line finds all the words which are unique - that is, that appear only once in the concatenated file, and the second line finds all the words which are repeated. These two lists are disjoint, and so may then be concatenated to form a master list, which can be found to contain 6196 words.

Surely this file is complete? Well, the English language is a great collector of words, and every year we find new words being used, many from other languages and cultures. Here are some words that are not in the all5.txt file:

Australian words: galah, dunny, smoko, durry, bogan, chook (there are almost certainly others)

Indian words: crore, lakhs, dosai, iddli, baati, chaat, kheer, kofta, kulfi, rasam, poori (the first two are numbers, the others are foods)

Scots words: canty, curch, flang, kythe, plack, routh, saugh, teugh, wadna - these are all used by Burns in his poems, which are written in English (admittedly a dialectical form of it).

New words: qango, fubar, crunk, noobs, vlogs, rando, vaper (the first two are excellent acronyms; the others are new words)

As with the Australian words, none of these lists are exhaustive; the full list of five-letter English words not in the file all5.txt would run probably into the many hundreds, maybe even thousands.

A note on word structures

I was curious about the numbers of vowels and consonants in words. To start, here's a little Julia function which encodes the positions of consonants as an integer between 0 and 31. For example, take the word "drive". We can encode this as [1,1,0,1,0] where the 1's are at the positions of the consonants. Then this can be considered as binary digits representing the number 27.

1julia> function cvs(word)
2           vs = indexin(collect(word),collect("aeiou"))
3           vs2 = replace(x -> isnothing(x) ? 1 : 0,vs)
4           return(sum(vs2 .* [16,8,4,2,1]))
5       end

Now we simply walk through the words in all5.txt determining their values as we go, and keeping a running total:

 1julia> wds = readlines(open("all5.txt"))
 2julia> cv = zeros(Int16,32)
 3julia> for w in wds
 4           c = cvs(w)
 5           cv[c+1] = cv[c+1]+1
 6       end
 7
 8julia> hcat(0:31,cv)
 932×2 Matrix{Int64}:
10  0     0
11  1     0
12  2     2
13  3     1
14  4     4
15  5    48
16  6    10
17  7    19
18  8     2
19  9    61
20 10    96
21 11   156
22 12    24
23 13   262
24 14    21
25 15    24
26 16     1
27 17    34
28 18   105
29 19   585
30 20    97
31 21  1514
32 22   432
33 23  1158
34 24     5
35 25   301
36 26   249
37 27   832
38 28    16
39 29    96
40 30    15
41 31    26

We see that the most common patterns are 21 = 10101, and 23 = 10111. But what about some of the smaller values?

 1julia> for w in wds
 2           if cvs(w) == 24
 3               println(w)
 4           end
 5       end
 6
 7stoae
 8wheee
 9whooo
10xviii
11xxiii

Yes, there are some Roman numerals hanging about, and probably they should be removed. And one more, 30 = 11110:

 1julia> for w in wds
 2           if cvs(w) == 30
 3               println(w)
 4           end
 5       end
 6
 7chyme
 8clxvi
 9cycle
10hydra
11hydro
12lycra
13phyla
14rhyme
15schmo
16schwa
17style
18styli
19thyme
20thymi
21xxxvi

Again a few Roman numerals. These may need to be removed by hand. One way to do this is by using regular expressions again:

 1$ grep -E '[xlcvi]{5}' all5.txt
 2civic
 3civil
 4clvii
 5clxii
 6clxiv
 7clxix
 8clxvi
 9lxvii
10villi
11xcvii
12xviii
13xxiii
14xxvii
15xxxii
16xxxiv
17xxxix
18xxxvi

and we see that we have 3 English words, and the rest Roman numerals. These can be deleted.