Word ladders with Julia
Lewis Carroll's game of Doublets
Such a long time since my last post! Well, that's the working life for you.
Anyway, recently I was reading about Lewis Carroll - always one of my favourite people - and was reminded of his word game "Doublets" in which one word is turned into another by changing one letter at a time, each new word being English.
You can read Carroll's original description here. Note his last sentence:
"It is, perhaps, needless to state that it is de rigueur that the links should be English words, such as might be used in good society."
Carroll, it seemed, frowned on slang words, or "low" words - very much in keeping with his personality and with his social and professional positions. One of his examples was
"Change WHEAT into BREAD"
which has an answer: WHEAT CHEAT CHEAP CHEEP CREEP CREED BREED BREAD.
Clearly we would want the length of the chain of words to be as short as possible; with a lower bound being the Hamming distance between the words: the number of places in which letters are different. This distance is 3 for WHEAT and BREAD and so the minimum changes will be 3. But in practice chains are longer, as the one above. English simply doesn't contain all possible words of 5 letters, and so we can't have, for example:
WHEAT WHEAD WREAD BREAD
This form of word puzzle, so simple and addictive, has been resurrected many times, often under such names as "word ladders", or "laddergrams".
Obtaining word lists
Every computer system will have a spell-check list on it; on a Linux system
these are usually found under /usr/share/dict
. My system, running Arch
Linux has these lists:
1$ wc -l /usr/share/dict/*
2 123115 /usr/share/dict/american-english
3 127466 /usr/share/dict/british-english
4 189057 /usr/share/dict/catalan
5 54763 /usr/share/dict/cracklib-small
6 88328 /usr/share/dict/finnish
7 221377 /usr/share/dict/french
8 304736 /usr/share/dict/german
9 92034 /usr/share/dict/italian
10 76258 /usr/share/dict/ogerman
11 56329 /usr/share/dict/spanish
12 123115 /usr/share/dict/usa
These all come from the package words, which is a collection of spell-check dictionaries.
Although the sizes of the English word lists may seem impressive, there are
bigger lists available. One of the best is at
SCOWL (Spell Checker Oriented Word Lists). You
can download the compressed SCOWL file, and when uncompressed you'll find it
contains a directory called final
. In this directory, the largest files are
those of the sort english-words.*
, and here's how big they are:
1$ wc -l english-words.*
2 4373 english-words.10
3 7951 english-words.20
4 36103 english-words.35
5 6391 english-words.40
6 23796 english-words.50
7 6233 english-words.55
8 13438 english-words.60
9 33270 english-words.70
10 139209 english-words.80
11 219489 english-words.95
12 490253 total
These lists contain increasingly more abstruse and unusual words. Thus
english-words.10
contains words that most adept speakers of English would
know:
1$ shuf -n 10 english-words.10
2errors
3hints
4green
5connections
6still
7mouth
8category's
9pi
10won's
11varied
At the other end, english-words.95
consists of unusual, obscure words
unlikely to be in a common vocabulary; many of them are seldom used, have very
specific meanings, or are technical:
1$ shuf -n 10 english-words.95
2deutschemark's
3disingenious
4retanner
5advancer's
6shlimazl
7unpontifical
8nonrequirement
9peccancy's
10photozinco
11nonuniting
This is the list which contains some splendid biological terms: "bdelloid", which is a class of microscopic water animals called rotifers; "ctenizidae", a small class of spiders (but the list does not contain "ctenidae" the much larger class of wandering spiders, including the infamous Brazilian wandering spider); "cnidocyst" which forms the stinging mechanism in the cells of the tentacles of jellyfish.
Putting the lists 10, 25, 30 together makes a "small" list; adding in 40 and 50 gives a "medium" list; next include 55, 65, 70 for "large"; "80" for "huge", and 95 for "insane". The author of SCOWL claims that 60 is the right level for spell checking, while: "The 95 contains just about every English word in existence and then some. Many of the words at the 95 level will probably not be considered valid English words by most people." (This is not true. I've discovered some words not in any list. One such is "buckleys", as in "He hansn't got buckleys chance", sometimes spelled with a capital B, and meaning "He hasn't got a chance"; that is, no chance at all. "Buckley" is only in the list of proper names, but given this usage it should be at least in the Australian words list. Which it isn't.)
You will also see from above that some words contain apostrophes: these words will need to be weeded out. Here's how to make a list and clean it up:
1$ cat english-words.10 english-words.25 english-words.30 > english-words-small.txt
2$ grep -v "'" english-words-small.txt > small.wds
This approach will yield five lists with the following numbers of words:
1$ wc -l *.wds
2 234563 huge.wds
3 414365 insane.wds
4 107729 large.wds
5 59577 medium.wds
6 38013 small.wds
7 854247 total
These can be read into Julia. These lists are unsorted, but that won't be an issue for our use of them. But you can certainly include a sort along the way.
Another list is available at https://github.com/dwyl/english-words which claims to have "over 466k English words". However, this list is not as carefully curated as is SCOWL.
Finally note that our lists are not disjoint, as are the original lists. Each
list includes its predecessor, so that insane.wds
contains all of the words in
all of the lists.
Using graph theory
The computation of word ladders can easily be managed using the tools of graph theory, with vertices being the words, and two vertices being adjacent if their Hamming distance is 1. Then finding a word ladder is easily done by a shortest path.
There is a problem though, as Donald Knuth discovered when he launched the first computerization of this puzzle, of which an explanation is available in his 1994 book The Stanford GraphBase. This page, you'll notice, contains "the 5757 five-letter words of English". However, deciding what is and what isn't an English word can be tricky: American versus English spellings, dialect words, newly created words and so on. I touched on this in an earlier post.
Knuth also found that there were 671 words which were connected to no others; he called these "aloof words", with ALOOF, of course, being one of them.
Using Julia
Although Python has its mature and powerful NetworkX
package for graph theory and network analysis, Python is too slow for this
application: we are looking at very large graphs of many thousands of vertices,
and computing the edges is a non-trivial task. So our choice of language is
Julia. Julia's graph theory packages are in a bit of state of flux, an old
package Graphs.jl
is unmaintained, as is LightGraphs.jl
. However, this latter
package is receiving a new lease of life with the unfortunately confusing name
of Graphs.jl
, and which is designed to be "functionally equivalent" to
LightGraphs.jl
.
This is the package I'll be using.
Setting up the graph
We'll use the medium.wds
dictionary since it's relatively small, and
we'll look at five letter words. Using six-letter words or the larger list will
then be a simple matter of changing a few parameters.
We start by open the list in Julia:
1Julia> f = open("medium.wds", "r")
2Julia> words = readlines(f)
3Julia> length(words)
4
5 59577
Now we can easily extract the five letter words, and set up the graph, first of all loading the Graphs package. We also need the StringDistances package to find the Hamming distance.
1Julia> using Graphs, StringDistances
2Julia> words5 = filter(x -> length(x)==5, words)
3Julia> w5 = length(words5)
4
5 4388
6
7Julia> G5 = Graph()
8Julia> add_vertices!(G5, w5);
Now the edges:
1Julia> for i in 1:w5
2 for j in i+1:w5
3 if Hamming()(words5[i],words5[j]) == 1
4 add_edge!(G5,i,j)
5 end
6 end
7 end
8Julia>
Note that there is a Julia package MetaGraph.jl
which allows you to add labels
to edges. However, it's just as easy to use the vertex numbers as indices into
the list of words.
We can't use the graph G5 directly, as it is not a connected graph (remember Donald Knuth's "aloof" words?) We'll do two things: find the aloof words, and choose from G5 the largest connected component. First the aloof words:
1Julia> aloofs = map(x->words5[x],findall(iszero, degree(G5)))
2Julia> show(aloofs)
I won't include this list, as it's too long - it contains 616 words. But if you do so, you'll see some surprises here: who'd have though that such innocuous words as "opera", "salad" or "wagon" were aloof? But they most certainly are, at least within this set of words.
And now for the connected component:
1Julia> CC = connected_components(G5)
2Julia> CL = map(length,CC) # size of each component
3Julia? mx, idx = findmax(CL)
4Julia> C5,vmap = induced_subgraph(G5, CC[idx])
You will find that the maximum connected component C5
has 3315 vertices, and
the value of idx
above is 3. Here vmap
is a list of length 3315 which is
the set of indices into the original list. Or, if the original list of vertices
consisted of the numbers 1,2, up to 4388, then vmap
is a sublist of
length 3315. And we can consider all those numbers as indices into the words5
list.
Now we can write a little function to produce word ladders; here using the Julia
function a_star
to find a shortest path:
1Julia> subwords = words5[vmap]
2Julia> function ladder(w1,w2)
3 i = findfirst(x -> words5[x] == w1, vmap)
4 j = findfirst(x -> words5[x] == w2, vmap)
5 P = a_star(C5,i,j)
6 verts = append!([i],map(dst,P))
7 subwords[verts]
8 end
And of course try it out:
1Julia> show(ladder("wheat","bread"))
2
3 ["wheat", "cheat", "cleat", "bleat", "bleak", "break", "bread"]
Note that a word ladder can only exist when both words have indices in the chosen largest connected component. For example:
1Julia> show(ladder("roses","scent"))
2
3 ["roses", "ruses", "rusts", "rests", "tests", "teats", "seats", "scats",
4 "scans", "scant", "scent"]
but
1Julia> show(ladder("chair","table"))
will produce an error, as neither of those words have indices in the largest connected component. In fact "chair" sits in a small connected component of 3 words, and "table" is in another connected component of 5 words.
The longest possible ladder
We have seen that the length of a word ladder will often exceed the Hamming distance between the start and ending words. But what is the maximum length of such a ladder?
Here the choice of function is the eccentricities of a graph: for each vertex, find the shortest path to every other vertex. The length of the longest such path is the eccentricity of that vertex.
1Julia> eccs = eccentricity(C5)
This command will take a noticeable amount of time - only a few seconds, it's true, but it is far from instantaneous. The intense computation needed here is one of the reasons that I prefer Julia to Python for this experiment.
Now we can use the eccentricities to find the longest path. Since this is an undirected graph, there must be at least two vertices with equal largest eccentricities.
1Julia> mx = maximum(eccs)
2Julia> inds = findall(x -> x == mx, eccs)
3Julia> ld = ladder(subwords[inds[1]], subwords[inds[2]])
4Julia> print(join(ld,", "),"\nwhich has length ",length(ld))
5
6aloud, cloud, clout, flout, float, bloat, bleat, cleat, cleft, clefs, clews,
7slews, sleds, seeds, feeds, feels, fuels, furls, curls, curly, curvy, curve,
8carve, calve, valve, value, vague, vogue, rogue"
9which has length 29
Who'd have guessed?
Results from other lists
With the 12478 five-letter words from the "large" list, there are 748 aloof words, and the longest ladder is
which has length 31. We might in general expect a shorter "longest ladder" than from using smaller list: the "large" list has far more words, hence greater connectivity, which would lead in many cases to shorter paths.
The "huge" and "insane" lists have longest ladders of length 28 and respectively:
and
For six-letter words, here are longest ladders in each of the lists:
Small, length 32:
Medium, length 44:
Large, length 43:
Huge, length 63:
Insane, length 49:
The length of time taken for the computations increases with size of the lists, and up to a certain extent, the length of the words. On my somewhat mature laptop (Lenovo X1 Carbon 3rd Generation), computing the eccentricities for six-letter words and the "insane" list took over 6 minutes.
A previous experiment with Mathematica
You can see this (with some nice graph diagrams) at https://blog.wolfram.com/2012/01/11/the-longest-word-ladder-puzzle-ever/ from about 10 years ago. This experiment used Mathematica's English language list, about which I can find nothing other than it exists. However, in the comments, somebody shows that Mathematica 8 had 92518 words in its English dictionary. And this number is dwarfed by Finnish, with 728498 words. But that might just as much be an artifact of Finnish lexicography. Deciding whether or nor a word is English is very hard indeed, and any lexicographer will decide on whether two words need separate definitions, or whether one can be defined within the other, so to speak - that is, under the same headword. MOUSE, MOUSY, MOUSELIKE - do these require three definitions, or two, or just one?