Automatically finding Codenames clues with GloVe vectors

by James Somers

Abstract: A simple vector-space model shows a surprising talent for cluing in the Codenames board game.

What is Codenames?

Codenames is a Czech board game by Vlaada Chvátil where the goal is to say a one-word clue to your teammates in order to get them to choose correctly from the words laid out on the table. The real game is played on a 5x5 board, but here is a typical situation faced by a clue-giver:

image

The three blue words are the target words—that's what you want your teammates to guess. The black word is the bomb; if your teammates say that one, they instantly lose the game. The tan words are neutral or perhaps belong to your opponent.

Your task is to come up with a single word that connects HAM, BEIJING, and IRON, while avoiding the others. (There are rules about which kinds of clues are allowable: usually it has to be a single word; proper nouns are optionally allowed.)

The game is interesting because it requires you to connect far-flung concepts precisely enough that other people can re-create your associations. It can be delightful, and frustrating, to see your friends' minds leap from idea to idea—often going places you never intended.

Can you think of a clue for the board above?

What are GloVe vectors?

"Word vectors" attempt to quantify meaning by plotting words in a high-dimensional space; words that are semantically related end up close to each other in the space.

One way to generate word vectors uses a neural network: you download a vast corpus of text, say all of Wikipedia. Then, you read the text into a small moving window, considering maybe ten words at a time—nine "context" words and one target word. Your goal is to predict the target from the context: you rejigger the weights of the network such that, based on the nine context words, it assigns a high probability to the tenth. At the heart of this neural network is a big matrix which has a column vector for each word; in the training process, you're esssentially nudging these vectors around. After training across the entire corpus, the vectors come to embody the semantics latent in the patterns of word usage.

It's a computationally intense procedure. Luckily, Stanford has published a data set of pre-trained vectors, the Global Vectors for Word Representation, or GloVe for short. (It uses a fancier method than the one described above.) It's just a list of words followed by 300 numbers, each number referring to a coordinate of that word's vector in a 300-dimensional space. The GloVe vectors we'll be using were trained on 42 billion words worth of text gotten from the Common Crawl.

Codenames as an AI problem

Codenames seems like a good Turing test: to come up with a clue, you need to not only understand the many shades of meaning each word can take on—"PAN," for instance, can be a piece of kitchenware, a way of criticizing, or a prefix meaning "all"—you also seem to need a model of the world. You connect "NARWHAL" to "NET" because you know that narwhals might be caught in nets. You connect "GRENADE" to "PALM" because you know that grenades are held in your hand; when you think of the two words together, you might even mentally simulate a throw. All this seems difficult for a computer to do.

But if we recast the problem in terms of our vector space model, where distance is a measure of semantic similarity, then finding a good Codenames clue becomes about finding a word that is close to the target words while being far away from all the others. That sounds a little simpler.

Getting the data

Download the vectors:

$ wget http://nlp.stanford.edu/data/glove.42B.300d.zip
$ unzip glove.42B.300d.zip

The words are sorted by the number of times they appear in the original corpus. Each word has a list of 300 coordinates associated with it. Here are the word vectors for was, or, and your:

In [1]:
with open("./glove.42B.300d.txt", 'r') as glove:
    lines = [next(glove) for x in xrange(100)]
[(l.split(" ")[0], ", ".join(l.split(" ")[1:20]) + "...") for l in lines[30:33]]
Out[1]:
[('was',
  '-0.0422, -0.00044414, 0.052895, -0.051688, 0.013487, -0.79987, -3.6616, 0.47123, 0.014875, -0.58004, -0.050214, -0.25385, -0.22905, -0.56836, 0.013797, 0.23938, -0.28826, -0.04298, 0.2424...'),
 ('or',
  '0.23333, -0.30334, -0.44491, -0.030651, 0.15669, 0.16303, -4.3474, 0.75635, -0.20263, -0.30256, 0.95183, -0.41293, 0.065988, -0.27925, -0.33301, 0.028757, -0.48017, -0.087209, 0.33913...'),
 ('your',
  '0.31261, -0.21024, -0.29676, 0.042702, -0.010114, -0.057844, -4.7176, 0.52637, -0.080199, -0.54652, 0.1178, 0.034668, 0.56859, 0.070415, -0.013684, -0.049383, 0.20602, -0.048774, 0.13903...')]

There are more than a million words in this file, which makes processing slow. So we'll write the top 50,000 words to a separate file:

$ head -n 50000 glove.42B.300d.txt > top_50000.txt

Now we're ready to do some analysis.

Finding words that are close to one another

We'll import some common libraries for numerical analysis:

In [2]:
import numpy as np
import pandas as pd
from scipy import spatial

Then, we'll create a map from words to their "embeddings", i.e., their 300-dimensional vector representations:

In [3]:
embeddings = {}
with open("./top_50000.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings[word] = vector

We can see which words are close to others by taking their cosine similarity—a measure of distance in high-dimensional space that computes the angle between two vectors:

$$distance(x, y) = 1 - cos(x, y) = 1 - \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}$$

With a quick look at some neighboring words, we can see that the distance metric works pretty well:

In [4]:
def distance(word, reference):
    return spatial.distance.cosine(embeddings[word], embeddings[reference])

def closest_words(reference):
    return sorted(embeddings.keys(), key=lambda w: distance(w, reference))

[(w, ", ".join(closest_words(w)[1:10]) + "...") for w in ["magic", "sport", "scuba", "sock"]]
Out[4]:
[('magic',
  'magical, spells, wizard, spell, trick, magician, enchanted, mystical, wonder...'),
 ('sport',
  'sports, sporting, racing, football, soccer, rugby, cycling, tennis, golf...'),
 ('scuba',
  'diving, dive, snorkeling, divers, diver, snorkel, kayaking, snorkelling, windsurfing...'),
 ('sock',
  'socks, yarn, shoe, knit, knitted, knitting, fetish, stocking, toe...')]

The model

We can express the Codenames problem as taking a set of "target" words and a set of "bad" words, then trying to find candidate words that are close to the targets and far from the bad words.

One way to do this is to calculate, for a given candidate clue, the sum of its distances from the bad words minus the sum of its distances from the target words. (When the target distances are smaller, it means the candidate is better.) That is, for each word $w$ in our dictionary we want to compute:

$$goodness(w) = \sum_{b \in bad}{cos(w, v_b)} - c \cdot \sum_{t \in targets}{cos(w, v_t)}$$

Then we pick the words with the highest values—say, the top 250 of them. (The constant $c>0$ expresses the fact that closeness to the target words is more important than farness from the bad words.)

The trouble is that a candidate that is close to one or two of the targets but far from the third can still score well—despite being a bad clue for that very reason. So, we sort our subset of 250 good candidates by the following:

$$minimax(w) = \underset{b \in bad}{\operatorname{arg min}}{cos(w, v_b)} - \underset{t \in targets}{\operatorname{arg max}}{cos(w, v_t)}$$

That is, we're looking to minimize the maximum distance from the targets, and maximize the mininum distance from the bad words. This is all pretty easy to express in code:

In [5]:
def goodness(word, answers, bad):
    if word in answers + bad: return -999
    return sum([distance(word, b) for b in bad]) - 4.0 * sum([distance(word, a) for a in answers])

def minimax(word, answers, bad):
    if word in answers + bad: return -999
    return min([distance(word, b) for b in bad]) - max([distance(word, a) for a in answers])

def candidates(answers, bad, size=100):
    best = sorted(embeddings.keys(), key=lambda w: -1 * goodness(w, answers, bad))
    res = [(str(i + 1), "{0:.2f}".format(minimax(w, answers, bad)), w) for i, w in enumerate(sorted(best[:250], key=lambda w: -1 * minimax(w, answers, bad))[:size])]
    return [(". ".join([c[0], c[2]]) + " (" + c[1] + ")") for c in res]

Experimental setup

I've been playing lots of Codenames with my friends and have gathered some data along the way. In the "experiments," there are 16 players who participate. Four players are assigned randomly to the same 3x3 board, like the one above, and are asked to give a clue independently to three receivers apiece. (The receivers don't see the colors on the board, obviously.)

The scoring works as follows:

  • Just like in the real game, when you guess an incorrect square, you're penalized. Here, you stop earning points.
  • You get 1 point for the first correct answer, 2 points for the second, and 3 points for the third.
  • Thus, scores for a round can be 0, 1, 3, or 6 points.

These experiments give a baseline of human performance, which can then be compared against the vector-space model.

Results

1. The best clue emerges

For instance, with the board above, we had the following clues and results:

 
Cluer 1: PIG
Receiver 4 HAM BEAR CAT 1 pts
Receiver 5 HAM BEIJING IRON 6 pts
Receiver 6 HAM IRON BEIJING 6 pts
 
Cluer 2: CAIDAO (Chinese for vegetable cleaver)
Receiver 7 BEIJING BEAR CAT 1 pts
Receiver 8 BEIJING IRON FALL 0 pts
Receiver 9 WITCH AMBULANCE NOTE 0 pts
 
Cluer 3: COMMODITIES
Receiver 10 IRON HAM BEAR 3 pts
Receiver 11 IRON NOTE HAM 1 pts
Receiver 12 IRON HAM FALL 0 pts
 
Cluer 4: WOK
Receiver 1 BEIJING IRON HAM 6 pts
Receiver 2 IRON BEIJING HAM 6 pts
Receiver 3 BEIJING HAM IRON 6 pts

Clearly "WOK" was the best clue. "CAIDAO" might have been a good clue except that none of the receivers understood what it meant. "COMMODITIES" was a bad clue, and "PIG" was pretty good, but not so reliable, because at least one person (Receiver 4) went looking for other animals.

Let's see what the computer comes up with. We'll print the first 100 candidates using the function above. The number in parens is the minimax score that we're sorting by:

In [6]:
import pandas as pd
from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

from IPython.display import HTML

def tabulate(data):
    data = list(grouper(10, data))
    return HTML(pd.DataFrame(data).to_html(index=False, header=False))
In [7]:
answers = ["iron", "ham", "beijing"]
bad = ["fall", "witch", "note", "cat", "bear", "ambulance"]

tabulate(candidates(answers, bad))
Out[7]:
1. tong (0.10) 2. wok (0.10) 3. guan (0.07) 4. kitchenware (0.07) 5. nippon (0.06) 6. torino (0.03) 7. thanh (0.03) 8. jian (0.03) 9. bao (0.03) 10. jia (0.02)
11. omelette (0.02) 12. sheng (0.01) 13. ge (0.01) 14. savoy (0.01) 15. lyon (0.01) 16. shu (0.01) 17. intercontinental (0.01) 18. buffet (0.01) 19. stir-fry (0.01) 20. badminton (0.00)
21. guo (0.00) 22. steamed (0.00) 23. dijon (-0.01) 24. xin (-0.01) 25. kowloon (-0.01) 26. hua (-0.01) 27. cantonese (-0.01) 28. sichuan (-0.01) 29. peking (-0.01) 30. tofu (-0.01)
31. yi (-0.01) 32. vietnamese (-0.01) 33. cuisine (-0.01) 34. xian (-0.02) 35. pan (-0.02) 36. yan (-0.02) 37. molybdenum (-0.02) 38. tenderloin (-0.02) 39. tottenham (-0.02) 40. swiss (-0.02)
41. xiang (-0.02) 42. hai (-0.02) 43. shen (-0.02) 44. turkish (-0.03) 45. dong (-0.03) 46. cookware (-0.03) 47. xu (-0.03) 48. mongolian (-0.03) 49. hock (-0.03) 50. noodles (-0.03)
51. wembley (-0.03) 52. hu (-0.04) 53. cn (-0.04) 54. omelet (-0.04) 55. selenium (-0.04) 56. yunnan (-0.04) 57. pork (-0.04) 58. jing (-0.04) 59. ware (-0.04) 60. seafood (-0.04)
61. wei (-0.04) 62. restaurant (-0.04) 63. newcastle (-0.05) 64. qingdao (-0.05) 65. jiang (-0.05) 66. barbecue (-0.05) 67. gong (-0.05) 68. ore (-0.05) 69. hongkong (-0.05) 70. minh (-0.05)
71. tian (-0.05) 72. corned (-0.05) 73. jiangsu (-0.05) 74. arsenal (-0.05) 75. manganese (-0.05) 76. glaze (-0.06) 77. palace (-0.06) 78. thai (-0.06) 79. barbeque (-0.06) 80. panini (-0.06)
81. beef (-0.06) 82. dishes (-0.06) 83. zhao (-0.06) 84. rice (-0.06) 85. cheng (-0.06) 86. brussel (-0.06) 87. qatar (-0.07) 88. pho (-0.07) 89. shandong (-0.07) 90. premier (-0.07)
91. tianjin (-0.07) 92. yong (-0.07) 93. manchester (-0.07) 94. ceramics (-0.08) 95. cooking (-0.08) 96. tai (-0.08) 97. tang (-0.08) 98. ming (-0.08) 99. ping (-0.08) 100. lu (-0.08)

I find these results pretty striking. The model here is simple geometry; it relies entirely on the meaning baked into the GloVe vectors. But wok appears! wok is basically a perfect clue—everyone was impressed with the friend who came up with it and upset they hadn't thought of it themselves—and here it is in the #2 spot, out of 50,000 candidates. tong (#1) might work well, though I don't quite understand the connection to "Beijing," and jian (#8), a word I hadn't heard before, fits decently well: it is a kind of Chinese sword. stir-fry (#19) and sichuan (#28) seem to evoke Chinese cooking. Notably, all of these clues are vastly better than "COMMODITIES," which is the one I came up with.

Of course, there's plenty of garbage (molybdenum (#37) (??), qatar (#87) (!?)), and many of the candidates are over-indexed to one or two of the targets at the expense of others. hock (#49), for instance, doesn't have anything to do with "Iron" or "Beijing," and omelette (#45), although connected to "Ham" and "Iron," is unrelated to "Beijing."

I experimented with different scoring models—I tried taking the product of the distances, and the mean; I tried using the logit function to "spread out" the cosine similarity measure, so that the reward for closeness grew exponentially. And I played with the constant $c$. But so far, the model above gives the best overall performance across the largest number of scenarios.

(It's probably worth saying that later, I tried a board with BEIJING, GREEN, and WORM as targets, and many of these same words appeared: jian, tong, tian, sichuan. Same if GREEN were changed to LAPTOP, but not when changed to DEER. So perhaps "Beijing" alone had conjured them up, and to some extent, the model got lucky.)

2. A harder board—and superhuman performance?

This was a doozy:


Cluer 1: MALTA
Receiver 1 ATLANTIS FAIR HOSPITAL 0 pts
Receiver 2 CHURCH ATLANTIS CAT 6 pts
Receiver 3 ATLANTIS CAT BUCK 3 pts


Cluer 2: MYSTIC
Receiver 4 ATLANTIS AZTEC CHURCH 1 pts
Receiver 5 ATLANTIS EYE AZTEC 1 pts
Receiver 6 AZTEC EYE CHURCH 0 pts


Cluer 3: ASLAN
Receiver 7 ATLANTIS CHURCH AZTEC 3 pts
Receiver 8 CAT CHURCH FAIR 0 pts
Receiver 9 CAT CHURCH FAIR 0 pts


Cluer 4: MYSTICAL
Receiver 10 ATLANTIS AZTEC EYE 1 pts
Receiver 11 ATLANTIS AZTEC CHURCH 1 pts
Receiver 12 CHURCH AZTEC ATLANTIS 1 pts

Only a single player managed to guess all three correctly, via the clue "MALTA." Much to my surprise, that clue appeared 12th on the model's list:

In [8]:
answers = ["church", "cat", "atlantis"]
bad = ["fair", "eye", "aztec", "buck", "pin", "hospital"]

tabulate(candidates(answers, bad))
Out[8]:
1. ark (0.04) 2. archangel (0.02) 3. darwin (0.01) 4. evolution (0.00) 5. graveyard (0.00) 6. destiny (0.00) 7. islands (-0.00) 8. bahamas (-0.01) 9. cave (-0.02) 10. pool (-0.02)
11. mystery (-0.02) 12. malta (-0.02) 13. sanctuary (-0.02) 14. island (-0.02) 15. paradise (-0.02) 16. ii (-0.03) 17. lighthouse (-0.03) 18. series (-0.04) 19. caribbean (-0.04) 20. cove (-0.04)
21. swimming (-0.04) 22. lost (-0.05) 23. universe (-0.05) 24. tower (-0.05) 25. kingdom (-0.05) 26. isle (-0.05) 27. stargate (-0.05) 28. mysteries (-0.06) 29. destroyed (-0.06) 30. journey (-0.06)
31. magnificent (-0.06) 32. ship (-0.06) 33. advent (-0.06) 34. discovery (-0.06) 35. castle (-0.06) 36. ocean (-0.06) 37. beach (-0.06) 38. views (-0.06) 39. adventures (-0.06) 40. secret (-0.07)
41. scripture (-0.07) 42. space (-0.07) 43. voyage (-0.07) 44. club (-0.07) 45. built (-0.07) 46. escape (-0.08) 47. sermon (-0.08) 48. eden (-0.08) 49. sea (-0.08) 50. build (-0.08)
51. stories (-0.08) 52. treasure (-0.08) 53. queen (-0.08) 54. souls (-0.08) 55. king (-0.08) 56. story (-0.08) 57. vatican (-0.08) 58. quest (-0.08) 59. heaven (-0.08) 60. supper (-0.08)
61. planet (-0.09) 62. grand (-0.09) 63. liturgy (-0.09) 64. apostles (-0.09) 65. boat (-0.09) 66. adventure (-0.09) 67. theology (-0.09) 68. century (-0.09) 69. bible (-0.09) 70. earth (-0.10)
71. biblical (-0.10) 72. rescue (-0.10) 73. ascension (-0.10) 74. lord (-0.10) 75. neptune (-0.10) 76. moon (-0.11) 77. apostolic (-0.11) 78. augustine (-0.11) 79. christianity (-0.11) 80. rock (-0.11)
81. revelation (-0.12) 82. easter (-0.12) 83. prophecy (-0.12) 84. towers (-0.12) 85. apollo (-0.12) 86. divine (-0.13) 87. stone (-0.13) 88. turtle (-0.13) 89. resurrection (-0.13) 90. greek (-0.13)
91. ghost (-0.14) 92. believers (-0.14) 93. reformed (-0.14) 94. jesus (-0.14) 95. temple (-0.14) 96. christ (-0.14) 97. orthodox (-0.14) 98. abandoned (-0.14) 99. sacred (-0.14) 100. god (-0.14)

Perhaps more surprising is the model's top pick, ark. I tried this clue on a friend who wasn't part of the initial experiment; they guessed all three targets correctly. Indeed ark might be a strictly better clue than "MALTA." (I like how it connects both to "Church" and to "Cat," and actually also to "Atlantis"—boat, island...—though it has a little interference with "Buck," which is also an animal that might end up on Noah's Ark.)

Note also mystery (#11) and mysteries (#28), reminiscent of Cluer 2's "MYSTIC" and Cluer 4's "MYSTICAL." aslan didn't have a chance of appearing since it didn't make the original cutoff for inclusion in the dictionary (it's about the 57,000th word).

As before, much of the list seems kind of useless. Clearly the program is noisy. But it's capable of generating clues that are sometimes as good as, if not better than, what a person could come up with.

For instance, I remember that early on, someone came up with a brilliant clue for SOCK, LUCK, and ATLANTIS, a board which had stumped everyone else. The clue was "Lost." Sure enough, the model discovers that clue, at #24. Good program!

3. Failures of imagination

A board with the targets THUMB, FOREST, and MOUNT ended up being pretty easy for human players. The best clue—chosen independently by three people—was "GREEN," and six players got perfect scores from it. But the computer can't seem to see it:

In [9]:
answers = ["thumb", "mount", "forest"]
bad = ["pin", "tag", "hood", "princess", "police", "ball"]

tabulate(candidates(answers, bad))
Out[9]:
1. ridges (0.02) 2. ecosystems (0.02) 3. elevation (0.01) 4. foliage (0.01) 5. ponderosa (0.00) 6. alps (0.00) 7. plantations (-0.00) 8. sits (-0.00) 9. forested (-0.00) 10. fork (-0.01)
11. dunes (-0.01) 12. grassland (-0.01) 13. grasslands (-0.01) 14. grazing (-0.01) 15. ecological (-0.01) 16. palm (-0.01) 17. etna (-0.01) 18. beside (-0.02) 19. volcano (-0.02) 20. plateau (-0.02)
21. vineyards (-0.02) 22. lower (-0.02) 23. yellowstone (-0.02) 24. pines (-0.02) 25. cherry (-0.02) 26. preserve (-0.02) 27. landscape (-0.02) 28. hiking (-0.02) 29. butte (-0.03) 30. nestled (-0.03)
31. yosemite (-0.03) 32. volcanic (-0.03) 33. situated (-0.03) 34. pike (-0.03) 35. plants (-0.04) 36. plantation (-0.04) 37. glacier (-0.04) 38. campground (-0.04) 39. landscapes (-0.04) 40. mammoth (-0.04)
41. palms (-0.04) 42. peaks (-0.04) 43. vegetation (-0.04) 44. trees (-0.04) 45. adjacent (-0.05) 46. spruce (-0.05) 47. cypress (-0.05) 48. shoreline (-0.05) 49. beneath (-0.05) 50. terrain (-0.05)
51. arbor (-0.05) 52. walnut (-0.05) 53. upper (-0.05) 54. ranch (-0.05) 55. atop (-0.05) 56. wooded (-0.05) 57. located (-0.05) 58. bend (-0.06) 59. middle (-0.06) 60. overlooking (-0.06)
61. acres (-0.06) 62. wyoming (-0.06) 63. slope (-0.06) 64. beech (-0.06) 65. drive (-0.06) 66. reservoir (-0.06) 67. falls (-0.06) 68. shasta (-0.06) 69. monument (-0.06) 70. large (-0.06)
71. oaks (-0.06) 72. beaver (-0.06) 73. ecosystem (-0.06) 74. moved (-0.06) 75. peninsula (-0.06) 76. sight (-0.06) 77. hike (-0.07) 78. basin (-0.07) 79. climbing (-0.07) 80. lakes (-0.07)
81. scenery (-0.07) 82. timber (-0.07) 83. mature (-0.07) 84. small (-0.07) 85. climb (-0.07) 86. lick (-0.07) 87. farms (-0.07) 88. scenic (-0.07) 89. remote (-0.07) 90. brook (-0.07)
91. peak (-0.07) 92. nature (-0.07) 93. habitat (-0.08) 94. valleys (-0.08) 95. wilderness (-0.08) 96. montana (-0.08) 97. trails (-0.08) 98. maple (-0.08) 99. tree (-0.08) 100. rd (-0.08)

ridges, the top clue, might work (the connection to "THUMB" is via the ridges on your fingerprint, I think) but when I tested it on someone, they replied with "mount, hood, forest."

There was a similar misfire with a BOND, PIRATE, BUGLE board. The winning clue was "GOLD," but the computer didn't come up with it. Its clues seem pretty weak—over-indexed to one or two targets—with the exception maybe of "corps" (#41) and "cadets" (#75):

In [10]:
answers = ["bond", "pirate", "bugle"]
bad = ["alien", "poison", "seal", "dice", "link", "spell"]

tabulate(candidates(answers, bad))
Out[10]:
1. rallies (0.06) 2. frock (0.05) 3. j.p. (0.03) 4. haircuts (0.02) 5. keegan (0.02) 6. davy (0.01) 7. raiser (0.01) 8. delilah (0.01) 9. mayfair (0.01) 10. cavalry (0.00)
11. bohemian (0.00) 12. rye (0.00) 13. quartet (-0.00) 14. militia (-0.00) 15. rifles (-0.00) 16. nightly (-0.01) 17. clarinet (-0.01) 18. tabloid (-0.01) 19. minnie (-0.02) 20. cheerleading (-0.02)
21. homecoming (-0.02) 22. bonanza (-0.02) 23. brigade (-0.02) 24. cay (-0.03) 25. jubilee (-0.03) 26. rum (-0.03) 27. bn (-0.03) 28. marching (-0.03) 29. fireman (-0.03) 30. ruff (-0.04)
31. seaman (-0.04) 32. polly (-0.04) 33. damsel (-0.04) 34. garrison (-0.04) 35. beaufort (-0.04) 36. goldeneye (-0.04) 37. wildcat (-0.04) 38. percussion (-0.04) 39. shipwreck (-0.05) 40. braid (-0.05)
41. corps (-0.05) 42. troop (-0.05) 43. regiment (-0.05) 44. leighton (-0.05) 45. boogie (-0.05) 46. vigilante (-0.06) 47. mardi (-0.06) 48. hank (-0.06) 49. rag (-0.06) 50. greenwich (-0.06)
51. carribean (-0.06) 52. putnam (-0.06) 53. tunes (-0.07) 54. knights (-0.07) 55. maid (-0.07) 56. patriotic (-0.07) 57. arabian (-0.07) 58. charleston (-0.07) 59. templeton (-0.07) 60. piggy (-0.07)
61. beads (-0.07) 62. sequin (-0.08) 63. phillips (-0.08) 64. cowboy (-0.08) 65. gras (-0.08) 66. flute (-0.08) 67. connery (-0.08) 68. parade (-0.08) 69. jolly (-0.08) 70. doo (-0.08)
71. bail (-0.08) 72. venetian (-0.08) 73. tavern (-0.08) 74. beaded (-0.08) 75. cadets (-0.08) 76. vfw (-0.09) 77. squad (-0.09) 78. skyfall (-0.09) 79. barnyard (-0.09) 80. moody (-0.09)
81. drill (-0.09) 82. nicholson (-0.10) 83. corp (-0.10) 84. us$ (-0.10) 85. outfits (-0.10) 86. beading (-0.10) 87. chesapeake (-0.10) 88. treasuries (-0.10) 89. bahamas (-0.10) 90. goldman (-0.10)
91. bullion (-0.10) 92. carnival (-0.10) 93. hooker (-0.10) 94. phillip (-0.10) 95. flotilla (-0.10) 96. pony (-0.10) 97. drum (-0.11) 98. ransom (-0.11) 99. bells (-0.11) 100. antique (-0.11)

It's hard to know what's happening here. Is it maybe that there aren't many co-occurrences of "gold" and "bond" in the Common Crawl corpus? Look at the distance of those two vectors:

In [11]:
distance("gold", "bond")
Out[11]:
0.625728577375412

For reference, let's consider a word that's close to "gold":

In [12]:
distance("gold", "plated")
Out[12]:
0.3826848268508911

...and one that bears really no relation (that I can see):

In [13]:
distance("gold", "mouse")
Out[13]:
0.7329027354717255

So "bond" is almost as far away from "gold" as "mouse" is. More surprisingly, "bugle"—an instrument that is often gold-colored—is even farther away, suggesting that the two words don't appear around each other, or even in similar contexts:

In [14]:
distance("gold", "bugle")
Out[14]:
0.7875212728977203

We humans can use our imaginations to connect words—and in many cases this turns out to be far more powerful than a measure of conceptual distance based on co-occurence in a large corpus.

Perhaps my favorite example comes with a board whose targets were ROUND, FIGHTER, and PALM. The model's best effort is ufc (#23); it seems preoccupied with MMA and boxing-related words:

In [15]:
answers = ["fighter", "round", "palm"]
bad = ["ivory", "knife", "point", "helicopter", "novel", "diamond"]

tabulate(candidates(answers, bad))
Out[15]:
1. brazilian (0.00) 2. silva (-0.01) 3. arcade (-0.02) 4. featured (-0.02) 5. 22nd (-0.02) 6. wellington (-0.03) 7. 26th (-0.03) 8. 23rd (-0.03) 9. pacquiao (-0.03) 10. 24th (-0.03)
11. boxing (-0.04) 12. 27th (-0.04) 13. ace (-0.04) 14. pro (-0.05) 15. 25th (-0.05) 16. champ (-0.05) 17. vegas (-0.05) 18. brazil (-0.06) 19. scheduled (-0.06) 20. showdown (-0.06)
21. 16th (-0.06) 22. 9th (-0.06) 23. ufc (-0.06) 24. 17th (-0.06) 25. 14th (-0.07) 26. wrestling (-0.07) 27. racing (-0.07) 28. ultimate (-0.08) 29. 15th (-0.08) 30. picks (-0.08)
31. 5th (-0.08) 32. street (-0.08) 33. 13th (-0.08) 34. middleweight (-0.08) 35. 8th (-0.08) 36. football (-0.08) 37. tennis (-0.08) 38. 12th (-0.08) 39. ass (-0.09) 40. vs (-0.09)
41. 7th (-0.09) 42. 6th (-0.09) 43. mma (-0.09) 44. compete (-0.09) 45. antonio (-0.10) 46. june (-0.10) 47. phoenix (-0.10) 48. 4th (-0.10) 49. champions (-0.10) 50. 11th (-0.10)
51. saturday (-0.11) 52. vs. (-0.11) 53. champion (-0.11) 54. championship (-0.11) 55. beating (-0.11) 56. july (-0.11) 57. championships (-0.11) 58. kicks (-0.12) 59. super (-0.12) 60. march (-0.12)
61. training (-0.12) 62. cup (-0.12) 63. diego (-0.12) 64. winner (-0.12) 65. heavyweight (-0.12) 66. winners (-0.12) 67. live (-0.12) 68. competition (-0.12) 69. cool (-0.12) 70. tournament (-0.12)
71. fights (-0.12) 72. 10th (-0.12) 73. popular (-0.12) 74. air (-0.12) 75. melbourne (-0.13) 76. world (-0.13) 77. friday (-0.13) 78. 3rd (-0.13) 79. apple (-0.14) 80. matches (-0.14)
81. kick (-0.14) 82. rounds (-0.14) 83. ready (-0.15) 84. sunday (-0.15) 85. games (-0.15) 86. knockout (-0.15) 87. huge (-0.15) 88. 200 (-0.15) 89. bet (-0.15) 90. wins (-0.15)
91. ball (-0.15) 92. beat (-0.15) 93. against (-0.15) 94. player (-0.16) 95. foot (-0.16) 96. biggest (-0.16) 97. tampa (-0.16) 98. miami (-0.16) 99. fight (-0.16) 100. calendar (-0.16)

One of the human cluers, though, came up with "GRENADE." In vector terms, this word ends up being pretty far from all of the targets:

In [16]:
distance("grenade", "fighter")
Out[16]:
0.7277005612850189
In [17]:
distance("grenade", "round")
Out[17]:
0.7292716205120087
In [18]:
distance("grenade", "palm")
Out[18]:
0.8634674698114395

The last two of these are especially interesting. We humans know that a grenade is round (more or less)—but of course our computer model doesn't. It doesn't know anything. It only considers the raw token grenade, and only "understands" it in relation to other tokens. And apparently that token doesn't appear in contexts involving words like round all that often...

Same, too, with palm. When we think of grenades, one of the things that immediately springs to mind is the fact that it's hand-held—particularly if that idea is primed by the presence of the word "PALM." This speaks to the richness of our mental models: it's not just words in there. By contrast, the only chance our dumb model has of seeing this association is if lots of texts happened to talk about palms, or hands, or fingers, in the same breath as grenades. Apparently that doesn't happen too often either.

4. Compu-trash

It's worth showing an example where the computer falls flat on its face. Consider this board:

Here's what the humans came up with:


Cluer 1: ALFRED
Receiver 1 ROBIN SERVER PLATE 3 pts
Receiver 2 SERVER ROBIN PLATE 3 pts
Receiver 3 ROBIN SERVER PLATE 3 pts


Cluer 2: NET
Receiver 4 SERVER SCREEN POLE 3 pts
Receiver 5 SERVER POLE SCREEN 1 pts
Receiver 6 SCREEN SERVER POLE 3 pts


Cluer 3: BATCOMPUTER
Receiver 7 SCREEN ROBIN SERVER 6 pts
Receiver 8 SCREEN SERVER ROBIN 6 pts
Receiver 9 ROBIN SCREEN SERVER 6 pts


Cluer 4: TWITTER
Receiver 10 SCREEN ROBIN SERVER 6 pts
Receiver 11 SCREEN ROBIN SERVER 6 pts
Receiver 12 ROBIN SCREEN SERVER 6 pts

There was much debate about whether "BATCOMPUTER" was even legitimate, but indeed we were allowing proper nouns and Wikipedia has Batcomputer spelled as one word. Clearly, though, "TWITTER" is the best clue, associating as it does to computer stuff ("screen," "server") and to birds ("robin").

How does the model compare?

In [19]:
answers = ["robin", "screen", "server"]
bad = ["pole", "plate", "ground", "pupil", "iron", "novel"]

tabulate(candidates(answers, bad))
Out[19]:
1. email (0.11) 2. yahoo (0.09) 3. chat (0.07) 4. mysql (0.07) 5. facebook (0.07) 6. e-mail (0.07) 7. msn (0.07) 8. dell (0.06) 9. name (0.06) 10. manager (0.06)
11. wizard (0.06) 12. saved (0.05) 13. sql (0.05) 14. ftp (0.05) 15. hello (0.04) 16. frontpage (0.04) 17. css (0.04) 18. password (0.03) 19. youtube (0.03) 20. skype (0.03)
21. mac (0.03) 22. host (0.03) 23. download (0.03) 24. addresses (0.02) 25. html (0.02) 26. http (0.02) 27. admin (0.02) 28. deleted (0.02) 29. preview (0.02) 30. oracle (0.02)
31. javascript (0.02) 32. streaming (0.02) 33. page (0.02) 34. itunes (0.02) 35. message (0.02) 36. guest (0.01) 37. offline (0.01) 38. ms (0.01) 39. editor (0.01) 40. backup (0.01)
41. vista (0.01) 42. homepage (0.01) 43. dns (0.01) 44. tutorial (0.01) 45. 2003 (0.00) 46. desktops (0.00) 47. bug (0.00) 48. updated (0.00) 49. google (0.00) 50. 2000 (0.00)
51. webpage (0.00) 52. send (-0.00) 53. login (-0.00) 54. mail (-0.00) 55. cgi (-0.01) 56. query (-0.01) 57. vm (-0.01) 58. update (-0.01) 59. flash (-0.01) 60. screenshot (-0.01)
61. networking (-0.01) 62. hosting (-0.01) 63. please (-0.02) 64. version (-0.02) 65. icons (-0.02) 66. graphics (-0.02) 67. dvd (-0.02) 68. website (-0.02) 69. video (-0.02) 70. wait (-0.02)
71. list (-0.02) 72. xp (-0.02) 73. info (-0.02) 74. queue (-0.02) 75. files (-0.02) 76. edit (-0.02) 77. unix (-0.02) 78. talk (-0.03) 79. iis (-0.03) 80. hosted (-0.03)
81. address (-0.03) 82. hosts (-0.03) 83. icon (-0.03) 84. player (-0.03) 85. shared (-0.03) 86. check (-0.03) 87. database (-0.03) 88. virtual (-0.03) 89. search (-0.04) 90. hd (-0.04)
91. logon (-0.04) 92. reboot (-0.04) 93. thanks (-0.04) 94. bios (-0.04) 95. messages (-0.04) 96. updates (-0.04) 97. phone (-0.04) 98. ubuntu (-0.04) 99. xml (-0.04) 100. ip (-0.04)

It's terrible! The over-indexing problem has basically spoiled the results. It's as if "screen" and "server" combined have so much mass that we get trapped in a gravity well far away from "robin."

You could imagine an interactive cluer's aid that allowed you to travel toward one target and away from the others. Indeed, a version of the model that arbitrarily weights "robin" as two or three times more important than "screen" and "saver" ends up with slightly more interesting clues like "webmaster" (perhaps a person named Robin?), but still didn't deliver "twitter." (Changing the constant $c$ above from 4.0 to 3.5 brings "twitter" into the 7th position—perhaps by increasing the universe of possible clues?—though at the expense of worse overall performance with other boards.)

Conclusion

A simple vector space model using cosine similarities can dig up human-level clues at least some of the time.

There's an over-indexing problem: words that happen to be very close to one or two of the targets will rank highly even when they're far away from the third. Minimizing the maximum distance from any target helps mitigate but doesn't entirely solve this problem.

There are some triplets that humans can cleverly connect with words that are rarely used in similar contexts, but which make sense when you think about them. Since the computer doesn't think, it doesn't generate those clues.

In general, the model's rankings are a little noisy—the 11th result is often no better than its 91st—but at a coarser level, it sorts its candidates remarkably well. If you're willing to do a little sifting, the top 100 or so results can include surprisingly good clues.

I wasn't expecting that. I thought the vector space model was a neat way of describing the Codenames problem, but I had little faith that I'd be able to write an actually useful program with it. It's strange, almost magical, that so much meaning can be baked into a list of coordinates.

Download the Jupyter Notebook

Github gist: https://gist.github.com/jsomers/1bb5e197dec221714df250e72265a301

Acknowledgements

Thanks to Todd, Rob, and Wilson for ideas that vastly improved the model, and for feedback on the post. Any remaining dumbness is mine.

A helpful post that got me started: https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db

Mobile-friendly Jupyter CSS taken from nbviewer.