# Five letter words in English

Share on:

I was going to make a little post about Wordle, but I go sidetracked exploring five letter words. At the same time, I had a bit of fun with regular expressions and some simple scripting with ZSH.

The start was to obtain lists of 5-letter words. One is available at the Stanford Graphbase site; the file sgb-words.txt contains "the 5757 five-letter words of English". Others are available through English-language wordlists, such as those in the Linux directory /usr/share/dict/. There are two such files for British and American English.

So we can start by gathering all these different lists of words, and also sorting Knuth's file so that it is in alphabetical order.

Here's how (assuming that we're in a writable directory that will contain these files):

1grep -E '^[a-z]{5}$' /usr/share/dict/american-english > ./usa5.txt 2grep -E '^[a-z]{5}$' /usr/share/dict/british-english > ./brit5.txt
3sort sgb-words.txt > ./knuth5.txt


The regular expressions in the first two lines simply ask for words of exactly five letters made from lower-case characters. This eliminates proper names and words with apostrophes.

Now let's see how many words each file contains:

1$wc -l *5.txt 2 3 5300 brit5.txt 4 5757 knuth5.txt 5 5150 usa5.txt 616207 total  Note the nice use of ZSH's powerful globbing features - one area which it is more powerful than BASH. Now there are too many words to see exactly the differences between them, but to start let's list the words in brit5.txt which are not in usa5.txt, and also the words in usa5.txt which are not in brit5.txt: 1$ grep -f usa5.txt -v brit5.txt > brit-usa_d.txt
2$grep -f brit5.txt -v usa5.txt > usa-bri_d.txt  I'm using a debased version of set difference for the output file, so that brit-usa_d.txt are those words in brit5.txt which are not in usa5.txt. I've added a _d to make a handle for globbing: 1$ wc -l *_d.txt
2
3 188 brit-usa_d.txt
4  38 usa-brit_d.txt
5 226 total


And now we can look at the contents of these files, using the handy Linux column command to print the output with fewer lines:

1$cat usa-brit_d.txt | column 2 3arbor chili fagot feces honor miter niter rigor savor vapor 4ardor color favor fiber humor molds ocher ruble slier vigor 5armor dolor fayer furor labor moldy odors rumor tumor 6calks edema fecal grays liter molts plows saber valor  Notice, as you may expect, that this file contains American spellings: "rigor" instead of British "rigour", "liter" instead of British "litre", and so on. However, the other file difference contains only a few spelling differences, and quite a lot of words not in the American wordlist:  1$ cat brit-usa_d.txt | column
2
3abaci	 blent	croci	flyer	hollo	liras	nitre	pupas	slily	togae	wrapt
4aeons	 blest	curst	fogey	homie	litre	nosey	rajas	slyer	topis	wrier
5agism	 blubs	dados	fondu	hooka	loxes	ochre	ranis	snuck	torah	yacks
6ameer	 bocce	deers	frier	horsy	lupin	odour	recta	spacy	torsi	yocks
7amirs	 bocci	dicky	gamey	huzza	macks	oecus	relit	spelt	tsars	yogin
8amnia	 boney	didos	gaols	idyls	maths	panty	ropey	spick	tyres	yucks
9ampul	 bosun	ditzy	gayly	ikons	matts	papaw	sabre	spilt	tzars	yuppy
10amuck	 briar	djinn	gipsy	imbed	mavin	pease	saree	stogy	ulnas	zombi
11appal	 brusk	drily	gismo	indue	metre	penes	sheik	styes	vacua
12aquae	 bunko	enrol	gnawn	jehad	miaow	pigmy	sherd	swops	veldt
13arses	 burqa	enure	greys	jinns	micra	pilau	shlep	synch	vitas
14aunty	 caddy	eying	gybed	junky	mitre	pilaw	shoed	tabus	vizir
15aurae	 calfs	eyrie	gybes	kabob	momma	pinky	shoon	tempi	vizor
17bassi	 celli	fayre	hallo	kerbs	moult	podia	shtik	tikes	whirr
18baulk	 chapt	fezes	hanky	kiddy	mynah	pricy	siree	tipis	whizz
19beaux	 clipt	fibre	heros	kopek	narks	prise	situp	tiros	wizes
20bided	 coney	fiord	hoagy	leapt	netts	pryer	skyed	toffy	wooly


Of course, some of these words are spelled with a different number of letters in American English: for example the British "djinn" is the American "jinn"; the British "saree" is the American "sari".

Now of course we want to see how the Knuth file differs, as it's the file with the largest number of words:

1$grep -f usa5.txt -v knuth5.txt > knuth-usa_d.txt 2$ grep -f brit5.txt -v knuth5.txt > knuth-brit_d.txt
3
4$wc -l knuth*_d.txt 5 6 895 knuth-brit_d.txt 7 980 knuth-usa_d.txt 8 1875 total  Remarkably enough, there are also words in both the original files which are not in Knuth's list: 1$ grep -f knuth5.txt -v usa5.txt > usa-knuth_d.txt
2$grep -f knuth5.txt -v brit5.txt > brit-knuth_d.txt 3 4$ wc -l *knuth_d.txt
5
6 438 brit-knuth_d.txt
7 373 usa-knuth_d.txt
8 811 total


So maybe our best bet would be to concatenate all the files, and take the all the words, leaving out any duplicates. Something like this:

1$cat usa5.txt brit5.txt knuth5.txt | sort | uniq -u > allu5.txt 2$ cat usa5.txt brit5.txt knuth5.txt | sort | uniq -d > alld5.txt
3$cat allu5.txt alld5.txt | sort > all5.txt  The first line finds all the words which are unique - that is, that appear only once in the concatenated file, and the second line finds all the words which are repeated. These two lists are disjoint, and so may then be concatenated to form a master list, which can be found to contain 6196 words. Surely this file is complete? Well, the English language is a great collector of words, and every year we find new words being used, many from other languages and cultures. Here are some words that are not in the all5.txt file: Australian words: galah, dunny, smoko, durry, bogan, chook (there are almost certainly others) Indian words: crore, lakhs, dosai, iddli, baati, chaat, kheer, kofta, kulfi, rasam, poori (the first two are numbers, the others are foods) Scots words: canty, curch, flang, kythe, plack, routh, saugh, teugh, wadna - these are all used by Burns in his poems, which are written in English (admittedly a dialectical form of it). New words: qango, fubar, crunk, noobs, vlogs, rando, vaper (the first two are excellent acronyms; the others are new words) As with the Australian words, none of these lists are exhaustive; the full list of five-letter English words not in the file all5.txt would run probably into the many hundreds, maybe even thousands. ## A note on word structures I was curious about the numbers of vowels and consonants in words. To start, here's a little Julia function which encodes the positions of consonants as an integer between 0 and 31. For example, take the word "drive". We can encode this as [1,1,0,1,0] where the 1's are at the positions of the consonants. Then this can be considered as binary digits representing the number 27. 1julia> function cvs(word) 2 vs = indexin(collect(word),collect("aeiou")) 3 vs2 = replace(x -> isnothing(x) ? 1 : 0,vs) 4 return(sum(vs2 .* [16,8,4,2,1])) 5 end  Now we simply walk through the words in all5.txt determining their values as we go, and keeping a running total:  1julia> wds = readlines(open("all5.txt")) 2julia> cv = zeros(Int16,32) 3julia> for w in wds 4 c = cvs(w) 5 cv[c+1] = cv[c+1]+1 6 end 7 8julia> hcat(0:31,cv) 932×2 Matrix{Int64}: 10 0 0 11 1 0 12 2 2 13 3 1 14 4 4 15 5 48 16 6 10 17 7 19 18 8 2 19 9 61 20 10 96 21 11 156 22 12 24 23 13 262 24 14 21 25 15 24 26 16 1 27 17 34 28 18 105 29 19 585 30 20 97 31 21 1514 32 22 432 33 23 1158 34 24 5 35 25 301 36 26 249 37 27 832 38 28 16 39 29 96 40 30 15 41 31 26  We see that the most common patterns are 21 = 10101, and 23 = 10111. But what about some of the smaller values?  1julia> for w in wds 2 if cvs(w) == 24 3 println(w) 4 end 5 end 6 7stoae 8wheee 9whooo 10xviii 11xxiii  Yes, there are some Roman numerals hanging about, and probably they should be removed. And one more, 30 = 11110:  1julia> for w in wds 2 if cvs(w) == 30 3 println(w) 4 end 5 end 6 7chyme 8clxvi 9cycle 10hydra 11hydro 12lycra 13phyla 14rhyme 15schmo 16schwa 17style 18styli 19thyme 20thymi 21xxxvi  Again a few Roman numerals. These may need to be removed by hand. One way to do this is by using regular expressions again:  1$ grep -E '[xlcvi]{5}' all5.txt
2civic
3civil
4clvii
5clxii
6clxiv
7clxix
8clxvi
9lxvii
10villi
11xcvii
12xviii
13xxiii
14xxvii
15xxxii
16xxxiv
17xxxix
18xxxvi


and we see that we have 3 English words, and the rest Roman numerals. These can be deleted.