Show HN: Experiments in AI-generation of crosswords (opens in new tab)

(abstractnonsense.com)

38 pointsabstractbill1y ago20 comments

Hi HN, I've been experimenting on-and-off over the years trying to automatically generate crosswords [1]. Recently I've been feeling like my results are good enough that I want to share them and see what other people think. I'm not trying to claim that these could appear in, say, the NYT in their current state, but honestly the velocity of progress makes me feel like I will inevitably be able to automatically generate NYT-quality crosswords within just a year or so.

A write-up is here: https://abstractnonsense.com/crosswords.html

And you can play the crosswords here: https://crosswordracing.com (They should work well on both desktop and mobile, and there's a leader-board for each crossword if you want to leave your name when you solve one).

[1]: Just in case anyone is interested, my very first attempt at this problem was way back in 2006! I used multiple wordlists (e.g. list of British monarchs, with reign dates), and wrote little functions to generate clues from each list (e.g. "British monarch who ruled from {date1} to {date2}"). Even with randomized synonym substitution and similar tricks, this approach was too labor-intensive, and the results too robotic, for it to work well. Can't complain though, that project led to me getting hired as the first engineer at Justin.TV!

20 comments

vunderba1y ago

Not bad.

As someone who has dabbled in AI generated crosswords I found that providing samples of "good crossword clues" (which I curated from historical NYT monday puzzles) as part of the LLM context helped tremendously in generating better clues.

There was also a Show HN for a generative AI crossword puzzle system a few months ago so I'll include what I mentioned there:

Part of the deep satisfaction in solving a crossword puzzle is the specificity of the answer. It's far more gratifying to answer a question with something like "Hawking" then to answer with "scientist", or answering with "mandelbrot" versus "shape".

So ideally, you want to lean towards "specificity" wherever possible, and use "generics" as filler.

Link:

https://news.ycombinator.com/item?id=41879754

abstractbillOP1y ago

Thanks. Yes, specificity of solutions seems like a good metric to optimize for.

In some of my crosswords I get clues that are specific in clever ways (e.g. one of these has "Extreme, not camping" which I thought was really strange until I found the answer "intense" and was very impressed by that level of wordplay from an LLM!)

korymath1y ago

Great post.

Funny, I just posted this to X

2025 GenAI challenge

Create a 5x5 crossword puzzle with two distinct solutions. Each clue must work for both solutions. Do not use the same word in both solutions. No black squares.

I try with each new model that lands. Still can’t get it.

alberto_balsam1y ago

Do you know if there is a solution to this by humans? I'd be interested in seeing it.

korymath1y ago

I've not found a solution at any NxN size made by human or machine.

quuxplusone1y ago

You might get a little closer by tweaking the prompt — you're asking the LLM to "figure out" that the first step is to create two 5x5 word squares with no repeated words, and then the second step is to solve ten requests of the form, "Give me a crossword-style clue that could be reasonably solved by either the words OPERA or the word TENET" (for each of the ten word-pairs in your square-pairs). However, LLMs are based on tokens, and thus fundamentally don't "understand" that words are made out of letters — that's why we have memes about their inability to count the number of "r"s in "strawberry" and so on. So we shouldn't expect an LLM to be able to perform Step 1 at all, really. And Step 2 requires wordplay and/or lateral thinking, which LLMs are again bad at. (They can easily do "Give me a crossword-style clue that could be solved by the word OPERA," because there are databases of such things on the web which form part of every LLM's dataset. But there's no such database for double-solution clues.)

Generating a 5x5 word square (with different words across and down, so not of the "Sator Arepo" variety) is already really hard for a human. I plugged the Wordle target word list into https://github.com/Quuxplusone/xword/blob/master/src/xword-f... to get a bunch of plausible squares like this:

    SCALD
    POLAR
    ARTSY
    CEASE
    ERROR

But you want two word squares that can plausibly be clued together, which is (not impossible, but) difficult if matching entries aren't the same part of speech. For example, cluing "POLAR" together with "ARTSY" (both adjectives) seems likely more doable than cluing "POLAR" together with "LASSO" (noun or verb).

Anyway, here's my attempt at a human solution, using the grid above — and another grid, which I'll challenge you to find from these clues. Hint: All but two of the ten pairs match, part-of-speech-wise.

    1A. Remove the outer layer of, perhaps  
    2A. Region on a globe  
    3A. Like some movie theaters  
    4A. Command to a lawbreaker  
    5A. Rhyme for Tom Lehrer?  
    1D. ____yard (sometime sci-fi setting)  
    2D. It goes something like this: Ꮎ  
    3D. Feature of liturgy, often  
    4D. It's vacuous, in a sense  
    5D. Fino, vis-a-vis Pedro Ximénez

echelon1y ago

That's algorithmically hard.

Ask the LLM to generate a program to solve the problem.

korymath1y ago

I've tried that, as recently as today with latest Gemini, Claude, and o1 ... none have been successful.

abstractbillOP1y ago

Thanks!

That's a wonderfully hard problem, I'd love to see it get solved.

super7ramp1y ago

Hi, thank you for the write-up.

> Once we have a grid, we try to fill it with words! I use simple backtracking search for that, with a timeout to stop the search on grids that are likely impossible to fill. In practice it's easy to generate a new filled grid from scratch about once every two minutes.

Have you explored other search techniques?

> After the grid is full of words, we use an LLM to generate some clues. I've iterated over many models and prompts for this.

Could you share the prompts and the models you tried?

Shameless plug: I've been interested in crossword generation for a while as well and made that toy: https://github.com/super7ramp/croiseur. No grid generation but automatic filling and clue generation. Clues are not really good, currently using gpt-4o-mini.

furyofantares1y ago

I've tried to get o1 to generate Xordle puzzles.

Warning: post contains a spoiler for a recent Xordle.

Xordle is Wordle with two target words that share no letters in common. Additionally, there is a "free clue" given at the start, and all three words are thematically linked. It's not always a straightforward link, for example a recent puzzle had the starter word 'grief' and targets 'empty' and 'chair'. All puzzles today are selected from user submissions.

o1 is the first model that's been able to solve Xordles reliably, or to generate valid puzzles at all. It's well-known that these things are massively handicapped for this type of task due to tokenization.

But since o1 can in fact achieve it, I wanted to see if I could get it to make puzzles that are at all satisfying. Instead it makes very bland puzzles, with straightforward connections and extremely broad themes.

Prompting can swing the pendulum too far in the other direction, to puzzles where the connection is contrived and impossible to see even after it's solved. As I've often experienced with LLMs, being able to hit either side of a target with prompting does not necessarily mean you can get it to land in the middle, and in fact I have had no success in doing so with this task.

This is one of the most basic examples I know of lack of creativity or "taste" to an LLM. It is a little hard for a human to generate two 5-letter words with no overlap, but it is extremely easy for a human to look for a thematic connection among 2-3 words and say if it's satisfying. But so far I've been totally unable to make the LLM make satisfying puzzles.

edit: Nothin' like making a claim about LLMs to get one up off one's ass and try to prove it wrong immediately. I'm getting some much better results with better examples now.

IanCal1y ago

Have you tried using an llm to say whether the puzzles are good or not?

abstractbillOP1y ago

Great observation, yeah, I've had very similar experiences with prompting, exactly as you said -- one direction giving very bland literal clues, and the opposite direction giving clues that are a stretch even when you know the answer!

corlinpalmer1y ago

Awesome! I have also dabbled in AI-generated crosswords, but I was more fascinated with the concept of generating the most efficient layout of an X-by-X grid from a given word set. It's a surprisingly difficult optimization problem because the combinatorics are insane. Here's an example output trying to find the most efficient layout of common Linux terminal commands:

    W   P     G   
    H I S T O R Y 
    E         O   
    R   T   Y U M 
  L E S S     P   
    I   O   C A T 
  U S E R A D D   
  L     T R   D C

Of course this is a pretty small grid and it gets more difficult with size. I've thought about making a competition from this sort of challenge. Would anyone be interested?

abstractbillOP1y ago

Yes! That a really fun problem too -- it feels like it should be tractable but it's insanely hard. If you do start some kind of competition around it, let me know, I'd be interested.

gowld1y ago

The "American" grids aren't American. An American grid almost always has 2 answers (both directions) per square.

abstractbillOP1y ago

Oh that's really interesting thanks! That would actually be an easy constraint to add too.

quuxplusone1y ago

American-style crossword construction has a number of constraints, some bendable, some not.

- Every cell must be "keyed," i.e., part of a word Across and a word Down. Unkeyed cells are strictly forbidden.

- No word may be less than 3 letters. Two-letter words are strictly forbidden.

- The grid must be rotationally symmetric. (But this rule can be broken for fun. Bilaterally symmetric grids are relatively common these days. Totally asymmetric grids are very rare and always in service of some kind of fun — see https://www.xwordinfo.com/Thumbs?select=symmetry )

- No more than one-sixth of the squares can be black. (But this rule can be broken, usually either to make the puzzle less challenging by shortening the average word length, or to make the creator's life easier in order to achieve some other feat.)

- If a single black square is bordered on two adjoining sides by other black squares, then it could be turned white without destroying the other properties of the grid. Such black squares are called "cheaters" and are frowned upon. (Though they might serve a purpose, e.g. to fit a specific theme entry's length.)

dgreensp1y ago

I found this article a bit disappointing.

The link at the bottom doesn’t work.

The grids shown do not follow the well-known rules of (American) crosswords: every square is part of two words of three or more letters each.

Coming up with a pattern of black squares, and writing good clues, are two parts of making a crossword puzzle that are IMO fun and benefit from a human touch, and are not overly difficult. There are also databases of past clues used in crossword puzzles (eg every NY Times clue ever, and various crossword dictionaries) for reference and possible training. If you don’t care about originality (or copyright) and want quality clues, you can just pull clues from these. If you do care about all those things, you can surface the list of clues used in the past to the human constructor and let them write the final clue. Or you can try to perfect LLM clue-writing. In my experience, LLMs are terrible at clues. Like sometimes if I try to give it feedback about a clue, it will just work the feedback into the clue… it’s a little hard to describe without an example, but basically it doesn’t seem to understand the requirements of a clue and the process of a solver looking at a clue and trying to come up with an answer.

Coming up with an interlocking set of fun, high-quality words and phrases is the hard part. I agree that LLM wordlist curation is a great idea, and I started playing around with that once.

Beyond that, I don’t think LLMs can help with grid construction, which is a more classic combinatorial problem.

abstractbillOP1y ago

> The link at the bottom doesn’t work.

Can you clarify which link is broken and how? What browser and OS?

> In my experience, LLMs are terrible at clues.

That hasn't been my experience. Without good prompting they give you clues that are too bland and literal, but it is quite possible to get them to give you clues with interesting and creative wordplay. I wish it was easier to get clues like that more consistently, but it's certainly doable. I still believe within a year it'll be easy.

j / k navigate · click thread line to collapse

20 comments

vunderba1y ago

Not bad.

There was also a Show HN for a generative AI crossword puzzle system a few months ago so I'll include what I mentioned there:

So ideally, you want to lean towards "specificity" wherever possible, and use "generics" as filler.

Link:

https://news.ycombinator.com/item?id=41879754

abstractbillOP1y ago

Thanks. Yes, specificity of solutions seems like a good metric to optimize for.

korymath1y ago

Great post.

Funny, I just posted this to X

2025 GenAI challenge

Create a 5x5 crossword puzzle with two distinct solutions. Each clue must work for both solutions. Do not use the same word in both solutions. No black squares.

I try with each new model that lands. Still can’t get it.

alberto_balsam1y ago

Do you know if there is a solution to this by humans? I'd be interested in seeing it.

korymath1y ago

I've not found a solution at any NxN size made by human or machine.

quuxplusone1y ago

    SCALD
    POLAR
    ARTSY
    CEASE
    ERROR

    1A. Remove the outer layer of, perhaps  
    2A. Region on a globe  
    3A. Like some movie theaters  
    4A. Command to a lawbreaker  
    5A. Rhyme for Tom Lehrer?  
    1D. ____yard (sometime sci-fi setting)  
    2D. It goes something like this: Ꮎ  
    3D. Feature of liturgy, often  
    4D. It's vacuous, in a sense  
    5D. Fino, vis-a-vis Pedro Ximénez

echelon1y ago

That's algorithmically hard.

Ask the LLM to generate a program to solve the problem.

korymath1y ago

I've tried that, as recently as today with latest Gemini, Claude, and o1 ... none have been successful.

abstractbillOP1y ago

Thanks!

That's a wonderfully hard problem, I'd love to see it get solved.

super7ramp1y ago

Hi, thank you for the write-up.

Have you explored other search techniques?

> After the grid is full of words, we use an LLM to generate some clues. I've iterated over many models and prompts for this.

Could you share the prompts and the models you tried?

furyofantares1y ago

I've tried to get o1 to generate Xordle puzzles.

Warning: post contains a spoiler for a recent Xordle.

edit: Nothin' like making a claim about LLMs to get one up off one's ass and try to prove it wrong immediately. I'm getting some much better results with better examples now.

IanCal1y ago

Have you tried using an llm to say whether the puzzles are good or not?

abstractbillOP1y ago

corlinpalmer1y ago

    W   P     G   
    H I S T O R Y 
    E         O   
    R   T   Y U M 
  L E S S     P   
    I   O   C A T 
  U S E R A D D   
  L     T R   D C

Of course this is a pretty small grid and it gets more difficult with size. I've thought about making a competition from this sort of challenge. Would anyone be interested?

abstractbillOP1y ago

Yes! That a really fun problem too -- it feels like it should be tractable but it's insanely hard. If you do start some kind of competition around it, let me know, I'd be interested.

gowld1y ago

The "American" grids aren't American. An American grid almost always has 2 answers (both directions) per square.

abstractbillOP1y ago

Oh that's really interesting thanks! That would actually be an easy constraint to add too.

quuxplusone1y ago

American-style crossword construction has a number of constraints, some bendable, some not.

- Every cell must be "keyed," i.e., part of a word Across and a word Down. Unkeyed cells are strictly forbidden.

- No word may be less than 3 letters. Two-letter words are strictly forbidden.

dgreensp1y ago

I found this article a bit disappointing.

The link at the bottom doesn’t work.

The grids shown do not follow the well-known rules of (American) crosswords: every square is part of two words of three or more letters each.

Coming up with an interlocking set of fun, high-quality words and phrases is the hard part. I agree that LLM wordlist curation is a great idea, and I started playing around with that once.

Beyond that, I don’t think LLMs can help with grid construction, which is a more classic combinatorial problem.

abstractbillOP1y ago

> The link at the bottom doesn’t work.

Can you clarify which link is broken and how? What browser and OS?

> In my experience, LLMs are terrible at clues.

j / k navigate · click thread line to collapse