The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!
This is how humans recognize text though. For the most part, humans don’t try to read languages we don’t understand. To deny a computer access to context is like asking a human to transcribe a language they don’t understand.
This stands out to me as improbable. Not in that the error rate could be that low, but in that they actually had humans spend hundreds of hours checking the accuracy of difficult character recognition. How did that happen?
http://doc.cat-v.org/bell_labs/reading_chess/reading_chess.p...
It doesn't actually quantify the human proofreading time. I might have recalled incorrectly; I heard about this in the late 1990's as a war story from another OCR researcher.
If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).
This means all these small documents can be compressed with repetition between games and not just within each game.
And probably much much faster and simpler etc.
Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.
On the other hand, lichess supports alternative chess modes and starting boards that may or may not be reachable from the standard initial configuration and legal moves, so this won't work for those use cases.
[1] https://www.postgresql.org/docs/current/storage-toast.html
The compression level achieved at a page level is much higher than compressing rows individually because there is lots of repetition between rows. It also speeds up io and other good things. It is almost always good, which is why the most modern database storage engines do it and do it by default.
Consider a csv file, and compare it to the same data stored as json objects, one row per line. The uncompressed json file is going to be much bigger, as the columns are repeated in every line. But both files gzip to much the same size, because all those keys are repeated again and again and the two files have basically the same entropy.
On the other hand, compressing each line in the file individually would be a poor choice, giving relatively poor gains.
There were database engines that did row-level compression, but these performed poorly and I know of nobody who used eg innodb compression.
1) First, as you mentioned, the limited symbol table for expressing all the moves.
2) But not every move is possible; you just need a way to order all the possible moves and then say the nth of those moves.
3) Even better: not all of those moves are equally likely. You can define (n cycles of) a chess playing engine that ranks all moves and reserves shorter symbols for the best moves (as judged within that time frame). Players tend to select higher ranked moves, so that’s another regularity to exploit.
Of course, 3) comes at the cost of compression/extraction time.
Otherwise an interesting article about injecting domain knowledge to improve beyond what's possible with a normal compression algorithm.
Interesting question: If you just generated the bit string that corresponded to taking the ‘most obvious move’ according to their heuristics, what game would that play out? In a way, that would be the ‘most obvious chess game’, or perhaps the ‘least creative chess game’...
In theory a more optimal version would be to use a more sophisticated AI to rank the next moves, and even to choose how aggressively to Huffman code the possible options.
In a sense this would measure the information content of a chess game (at least relative to the ‘knowledge’ of that AI) . I wonder what the variance in the number of bits needed to encode various games would tell you about the play style of different players, or the nature of the game, or even the relationship betweeen information theory and creativity....
I wouldn't be surprised if the statistics for the first few moves were significantly different to the moves deep into the game.
Algorithmically cool, but quite a lot of work to save <$500/yr in hardware for a handful of 1TB SSDs.
Converting the data to the new format cost more than upgrading disks.
If rate of growth is disk usage is at or below the rate of growth in mid-tier SSDs, then yes, it's $500/yr. If you are growing faster than that, then an improvement might be saving you $350/yr + $150/yr², and a wall in the future might be pushed out for years.
With your Cloud provider, it might be hard to get more disk, memory, or CPU without paying for more of the other two. Also, many real organizations and vendor agreements are full of the most stupid artificial constraints. Just because someone else can build a 1PB SSD storage array doesn't mean that any of my coworkers will be allowed to build one any time soon. CAPEX austerity measures are among the most frustrating penny-wise pound-foolish policies we have to deal with.
If you also store move times, there’s not much of a win to this.