Word ladders with Julia

Nov 3, 2022 programming julia

Lewis Carroll's game of Doublets

Such a long time since my last post! Well, that's the working life for you.

Anyway, recently I was reading about Lewis Carroll - always one of my favourite people - and was reminded of his word game "Doublets" in which one word is turned into another by changing one letter at a time, each new word being English.

You can read Carroll's original description here. Note his last sentence:

"It is, perhaps, needless to state that it is de rigueur that the links should be English words, such as might be used in good society."

Carroll, it seemed, frowned on slang words, or "low" words - very much in keeping with his personality and with his social and professional positions. One of his examples was

"Change WHEAT into BREAD"

which has an answer: WHEAT CHEAT CHEAP CHEEP CREEP CREED BREED BREAD.

Clearly we would want the length of the chain of words to be as short as possible; with a lower bound being the Hamming distance between the words: the number of places in which letters are different. This distance is 3 for WHEAT and BREAD and so the minimum changes will be 3. But in practice chains are longer, as the one above. English simply doesn't contain all possible words of 5 letters, and so we can't have, for example:

WHEAT WHEAD WREAD BREAD

This form of word puzzle, so simple and addictive, has been resurrected many times, often under such names as "word ladders", or "laddergrams".

Obtaining word lists

Every computer system will have a spell-check list on it; on a Linux system these are usually found under /usr/share/dict . My system, running Arch Linux has these lists:

 1$ wc -l /usr/share/dict/*
 2  123115 /usr/share/dict/american-english
 3  127466 /usr/share/dict/british-english
 4  189057 /usr/share/dict/catalan
 5   54763 /usr/share/dict/cracklib-small
 6   88328 /usr/share/dict/finnish
 7  221377 /usr/share/dict/french
 8  304736 /usr/share/dict/german
 9   92034 /usr/share/dict/italian
10   76258 /usr/share/dict/ogerman
11   56329 /usr/share/dict/spanish
12  123115 /usr/share/dict/usa

These all come from the package words, which is a collection of spell-check dictionaries.

Although the sizes of the English word lists may seem impressive, there are bigger lists available. One of the best is at SCOWL (Spell Checker Oriented Word Lists). You can download the compressed SCOWL file, and when uncompressed you'll find it contains a directory called final. In this directory, the largest files are those of the sort english-words.*, and here's how big they are:

 1$ wc -l english-words.*
 2   4373 english-words.10
 3   7951 english-words.20
 4  36103 english-words.35
 5   6391 english-words.40
 6  23796 english-words.50
 7   6233 english-words.55
 8  13438 english-words.60
 9  33270 english-words.70
10 139209 english-words.80
11 219489 english-words.95
12 490253 total

These lists contain increasingly more abstruse and unusual words. Thus english-words.10 contains words that most adept speakers of English would know:

 1$ shuf -n 10 english-words.10
 2errors
 3hints
 4green
 5connections
 6still
 7mouth
 8category's
 9pi
10won's
11varied

At the other end, english-words.95 consists of unusual, obscure words unlikely to be in a common vocabulary; many of them are seldom used, have very specific meanings, or are technical:

 1$ shuf -n 10 english-words.95
 2deutschemark's
 3disingenious
 4retanner
 5advancer's
 6shlimazl
 7unpontifical
 8nonrequirement
 9peccancy's
10photozinco
11nonuniting

This is the list which contains some splendid biological terms: "bdelloid", which is a class of microscopic water animals called rotifers; "ctenizidae", a small class of spiders (but the list does not contain "ctenidae" the much larger class of wandering spiders, including the infamous Brazilian wandering spider); "cnidocyst" which forms the stinging mechanism in the cells of the tentacles of jellyfish.

Putting the lists 10, 25, 30 together makes a "small" list; adding in 40 and 50 gives a "medium" list; next include 55, 65, 70 for "large"; "80" for "huge", and 95 for "insane". The author of SCOWL claims that 60 is the right level for spell checking, while: "The 95 contains just about every English word in existence and then some. Many of the words at the 95 level will probably not be considered valid English words by most people." (This is not true. I've discovered some words not in any list. One such is "buckleys", as in "He hansn't got buckleys chance", sometimes spelled with a capital B, and meaning "He hasn't got a chance"; that is, no chance at all. "Buckley" is only in the list of proper names, but given this usage it should be at least in the Australian words list. Which it isn't.)

You will also see from above that some words contain apostrophes: these words will need to be weeded out. Here's how to make a list and clean it up:

1$ cat english-words.10  english-words.25  english-words.30 > english-words-small.txt
2$ grep -v "'" english-words-small.txt > small.wds

This approach will yield five lists with the following numbers of words:

1$ wc -l *.wds
2 234563 huge.wds
3 414365 insane.wds
4 107729 large.wds
5  59577 medium.wds
6  38013 small.wds
7 854247 total

These can be read into Julia. These lists are unsorted, but that won't be an issue for our use of them. But you can certainly include a sort along the way.

Another list is available at https://github.com/dwyl/english-words which claims to have "over 466k English words". However, this list is not as carefully curated as is SCOWL.

Finally note that our lists are not disjoint, as are the original lists. Each list includes its predecessor, so that insane.wds contains all of the words in all of the lists.

Using graph theory

The computation of word ladders can easily be managed using the tools of graph theory, with vertices being the words, and two vertices being adjacent if their Hamming distance is 1. Then finding a word ladder is easily done by a shortest path.

There is a problem though, as Donald Knuth discovered when he launched the first computerization of this puzzle, of which an explanation is available in his 1994 book The Stanford GraphBase. This page, you'll notice, contains "the 5757 five-letter words of English". However, deciding what is and what isn't an English word can be tricky: American versus English spellings, dialect words, newly created words and so on. I touched on this in an earlier post.

Knuth also found that there were 671 words which were connected to no others; he called these "aloof words", with ALOOF, of course, being one of them.

Using Julia

Although Python has its mature and powerful NetworkX package for graph theory and network analysis, Python is too slow for this application: we are looking at very large graphs of many thousands of vertices, and computing the edges is a non-trivial task. So our choice of language is Julia. Julia's graph theory packages are in a bit of state of flux, an old package Graphs.jl is unmaintained, as is LightGraphs.jl. However, this latter package is receiving a new lease of life with the unfortunately confusing name of Graphs.jl, and which is designed to be "functionally equivalent" to LightGraphs.jl.

This is the package I'll be using.

Setting up the graph

We'll use the medium.wds dictionary since it's relatively small, and we'll look at five letter words. Using six-letter words or the larger list will then be a simple matter of changing a few parameters.

We start by open the list in Julia:

1Julia> f = open("medium.wds", "r")
2Julia> words = readlines(f)
3Julia> length(words)
4
5       59577

Now we can easily extract the five letter words, and set up the graph, first of all loading the Graphs package. We also need the StringDistances package to find the Hamming distance.

1Julia> using Graphs, StringDistances
2Julia> words5 = filter(x -> length(x)==5, words)
3Julia> w5 = length(words5)
4
5       4388
6
7Julia> G5 = Graph()
8Julia> add_vertices!(G5, w5);

Now the edges:

1Julia> for i in 1:w5
2	   for j in i+1:w5
3	       if Hamming()(words5[i],words5[j]) == 1
4		   add_edge!(G5,i,j)
5	       end
6	   end
7       end
8Julia>

Note that there is a Julia package MetaGraph.jl which allows you to add labels to edges. However, it's just as easy to use the vertex numbers as indices into the list of words.

We can't use the graph G5 directly, as it is not a connected graph (remember Donald Knuth's "aloof" words?) We'll do two things: find the aloof words, and choose from G5 the largest connected component. First the aloof words:

1Julia> aloofs = map(x->words5[x],findall(iszero, degree(G5)))
2Julia> show(aloofs)

I won't include this list, as it's too long - it contains 616 words. But if you do so, you'll see some surprises here: who'd have though that such innocuous words as "opera", "salad" or "wagon" were aloof? But they most certainly are, at least within this set of words.

And now for the connected component:

1Julia> CC = connected_components(G5)
2Julia> CL = map(length,CC)  # size of each component
3Julia? mx, idx = findmax(CL)
4Julia> C5,vmap = induced_subgraph(G5, CC[idx])

You will find that the maximum connected component C5 has 3315 vertices, and the value of idx above is 3. Here vmap is a list of length 3315 which is the set of indices into the original list. Or, if the original list of vertices consisted of the numbers 1,2, up to 4388, then vmap is a sublist of length 3315. And we can consider all those numbers as indices into the words5 list.

Now we can write a little function to produce word ladders; here using the Julia function a_star to find a shortest path:

1Julia> subwords = words5[vmap]
2Julia> function ladder(w1,w2)
3	   i = findfirst(x -> words5[x] == w1, vmap)
4	   j = findfirst(x -> words5[x] == w2, vmap)
5	   P = a_star(C5,i,j)
6	   verts = append!([i],map(dst,P))
7	   subwords[verts]
8       end

And of course try it out:

1Julia> show(ladder("wheat","bread"))
2
3       ["wheat", "cheat", "cleat", "bleat", "bleak", "break", "bread"]

Note that a word ladder can only exist when both words have indices in the chosen largest connected component. For example:

1Julia> show(ladder("roses","scent"))
2
3       ["roses", "ruses", "rusts", "rests", "tests", "teats", "seats", "scats",
4	"scans", "scant", "scent"]

but

1Julia> show(ladder("chair","table"))

will produce an error, as neither of those words have indices in the largest connected component. In fact "chair" sits in a small connected component of 3 words, and "table" is in another connected component of 5 words.

The longest possible ladder

We have seen that the length of a word ladder will often exceed the Hamming distance between the start and ending words. But what is the maximum length of such a ladder?

Here the choice of function is the eccentricities of a graph: for each vertex, find the shortest path to every other vertex. The length of the longest such path is the eccentricity of that vertex.

1Julia> eccs = eccentricity(C5)

This command will take a noticeable amount of time - only a few seconds, it's true, but it is far from instantaneous. The intense computation needed here is one of the reasons that I prefer Julia to Python for this experiment.

Now we can use the eccentricities to find the longest path. Since this is an undirected graph, there must be at least two vertices with equal largest eccentricities.

1Julia> mx = maximum(eccs)
2Julia> inds = findall(x -> x == mx, eccs)
3Julia> ld = ladder(subwords[inds[1]], subwords[inds[2]])
4Julia> print(join(ld,",  "),"\nwhich has length ",length(ld))
5
6aloud,  cloud,  clout,  flout,  float,  bloat,  bleat,  cleat,  cleft,  clefs,  clews,
7slews,  sleds,  seeds,  feeds,  feels,  fuels,  furls,  curls,  curly,  curvy,  curve,
8carve,  calve,  valve,  value,  vague,  vogue,  rogue"
9which has length  29

Who'd have guessed?

Results from other lists

With the 12478 five-letter words from the "large" list, there are 748 aloof words, and the longest ladder is

rayon, racon, bacon, baron, boron, boson, bosom, besom, besot, beset, beret, buret, curet, cures, curds, surds, suras, auras, arras, arias, arils, anils, anile, anole, anode, abode, abide, amide, amine, amino, amigo

which has length 31. We might in general expect a shorter "longest ladder" than from using smaller list: the "large" list has far more words, hence greater connectivity, which would lead in many cases to shorter paths.

The "huge" and "insane" lists have longest ladders of length 28 and respectively:

essay, assay, asway, alway, allay, alley, agley, aglet, ablet, abled, ailed, aired, sired, sered, seres, serrs, sears, scars, scary, snary, unary, unarm, inarm, inerm, inert, inept, inapt, unapt

and

entry, entsy, antsy, artsy, artly, aptly, apply, apple, ample, amole, amoke, smoke, smote, smite, suite, quite, quote, quott, quoit, qubit, oubit, orbit, orbic, urbic, ureic, ureid, ursid

For six-letter words, here are longest ladders in each of the lists:

Small, length 32:

steady, steamy, steams, steals, steels, steers, sheers, cheers, cheeks, checks, chicks, clicks, slicks, slices, spices, spites, smites, smiles, smiled, sailed, tailed, tabled, tabbed, dabbed, dubbed, rubbed, rubber, rubier, rubies, rabies, babies, babied

Medium, length 44:

trusty, trusts, crusts, crests, chests, cheats, cleats, bleats, bloats, floats, flouts, flours, floors, floods, bloods, broods, brooks, crooks, crocks, cracks, cranks, cranes, crates, crated, coated, boated, bolted, belted, belied, belies, bevies, levies, levees, levers, lovers, hovers, hovels, hotels, motels, models, modals, morals, morays, forays

Large, length 43:

uneasy, unease, urease, crease, creese, cheese, cheesy, cheeky, cheeks, creeks, breeks, breeds, breads, broads, broods, bloods, bloops, sloops, stoops, strops, strips, stripe, strive, shrive, shrine, serine, ferine, feline, reline, repine, rapine, raping, raring, haring, hiring, siring, spring, sprint, splint, spline, saline, salina, saliva

Huge, length 63:

aneath, sneath, smeath, smeeth, smeech, sleech, fleech, flench, clench, clunch, clutch, crutch, crotch, crouch, grouch, grough, trough, though, shough, slough, slouch, smouch, smooch, smooth, smoots, smolts, smalts, spalts, spaits, spains, spaing, spring, sprint, splint, spline, upline, uplink, unlink, unkink, unkind, unkend, unkent, unsent, unseat, unseal, unseel, unseen, unsewn, unsews, unmews, enmews, emmews, emmers, embers, embars, embark, imbark, impark, impart, import, impost, impose, impone

Insane, length 49:

ambach, ambash, ambush, embush, embusk, embulk, embull, emball, empall, emparl, empark, embark, embars, embers, emmers, emmews, enmews, endews, enders, eiders, ciders, coders, cooers, cooees, cooeed, coomed, cromed, cromes, crimes, crises, crisis, crisic, critic, iritic, iridic, imidic, amidic, amidin, amidon, amydon, amylon, amylin, amyrin, amarin, asarin, asaron, usaron, uzaron, uzarin

The length of time taken for the computations increases with size of the lists, and up to a certain extent, the length of the words. On my somewhat mature laptop (Lenovo X1 Carbon 3rd Generation), computing the eccentricities for six-letter words and the "insane" list took over 6 minutes.

A previous experiment with Mathematica

You can see this (with some nice graph diagrams) at https://blog.wolfram.com/2012/01/11/the-longest-word-ladder-puzzle-ever/ from about 10 years ago. This experiment used Mathematica's English language list, about which I can find nothing other than it exists. However, in the comments, somebody shows that Mathematica 8 had 92518 words in its English dictionary. And this number is dwarfed by Finnish, with 728498 words. But that might just as much be an artifact of Finnish lexicography. Deciding whether or nor a word is English is very hard indeed, and any lexicographer will decide on whether two words need separate definitions, or whether one can be defined within the other, so to speak - that is, under the same headword. MOUSE, MOUSY, MOUSELIKE - do these require three definitions, or two, or just one?