In fact, I think this smart merge technique would have been
faster (and exact, with no approximations!) than what McIlroy 1982 describes on 1975..1980 hardware
in spite of not fitting in memory. The end of that article says it took about 65..77 seconds of CPU on a PDP 11/70 to check a 5500 word document.
Meanwhile, the approach of the below 3 files (which would need only tiny adaptations to 1975 C) should have taken about 4 seconds at 200 kB/sec IO time (see below). This kind of technique was quite common in the 1960s database community also. So, it's a bit weird for the Bell labs guys to not have known about it, and it shows off the C/Unix system just as well. Here is an example implementation:
delta.c:
#include <stdio.h> /* stdin->out: delta encode sorted lines */
#include <string.h>
struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
int main(int ac, char **av) {
char word[64], prev[64];
int nPrev = 0;
while (fgets(&word[0], sizeof word, stdin)) {
int n = strlen(word), nPfx;
int m = n < nPrev ? n : nPrev;
for (nPfx = 0; nPfx < m; nPfx++)
if (prev[nPfx] != word[nPfx]) break;
nPS.nPfx = nPfx;
nPS.nSfx = n - nPfx - 1;
fwrite(&nPS, 1, 1, stdout);
fwrite(&word[nPfx], 1, n - nPfx - 1, stdout);
nPrev = n - 1;
memcpy(prev, word, n - 1);
}
return 0;
}
words.c:
#include <stdio.h> /* Emit all words in stdin, 1 to a line */
#include <ctype.h> /*~tr -cs A-Za-z \\n|tr A-Z a-z|grep -vE .{16,} */
int main(int ac, char **av) {
char wd[15];
int n, c;
#define PUT fwrite_unlocked(wd,1,n,stdout); putchar_unlocked('\n')
while ((c = getchar_unlocked()) >= 0) { /* accum word chs */
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')) {
if (n < sizeof wd) /* make smarter & .. */
wd[n++] = tolower(c); /*..contraction aware. */
else
fprintf(stderr, "too long: %15.15s...\n", wd);
} else if (n > 0) { /* non-word c & have data */
PUT; n = 0; /* put & reset */
}
}
if (n > 0) { PUT; } /* put any final word */
return 0;
}
notIn.c:
#include <stdio.h> /* Emit stdin words not in /tmp/dict */
#include <string.h> /* Both sources must be delta-encoded */
struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
int main(int ac, char **av) {
FILE *fI = stdin, *fD = fopen("/tmp/dict", "r");
char wI[16], wD[16]; /* Word buffers; then lens */
int nI=0, nD=0, oI, oD; /* Flag saying last read was Ok */
#define GET(T) /* update [wno][ID] with its next word */ \
if (o##T = fread(&nPS, 1, sizeof nPS, f##T)==sizeof nPS) { \
n##T = nPS.nPfx + nPS.nSfx; \
o##T = o##T && fread(&w##T[nPS.nPfx], 1, nPS.nSfx, f##T);}
#define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); GET(I) }
GET(I); GET(D); /* set theory "not in": ordered merge impl */
while (oI && oD) {
int c = memcmp(wI, wD, nI < nD ? nI : nD);
if (c == 0) c = nI - nD;
if (c < 0) PUT /* I<D: emit&advance I */
else if (c == 0){GET(I);GET(D);} /* I==D: advance both */
else GET(D); /* I>D: advance D */
}
while (oI) PUT /* flush tail out */
} /* (echo a;cat $SOWPODS) | delta>/tmp/dict # DE UN-abridged dict
words < Input | sort -u | delta | notIn # Extract,DE,Check */
To back up my 4.2 seconds at 200 kB/sec, you can find any SOWPODS dictionary and it compresses to about 837 KiB with that delta.c. 837/200 = 4.185.
If the rejoinder is "that SOWPODS stuff had/has licensing trouble" then no problemo - just use whatever in house dictionary they used and stemming / auto-suffix junk and use that to synthesize an exact dictionary. Then you can correct it as you go and et voila. In fact, if you wanted to be even faster than this 15..20X faster and then make accuracy-perf trade-offs, then you could probably shrink IO by generating the delta-encoded stream directly from the suffix rules.
I'd recommend staying exact, though. In this case, it seems a bad algo idea led them to be both inaccurate and slower and the writing was on the wall that hardware was getting faster. And honestly, my SOWPODS dictionary that seems 15X faster may well have better coverage than what they did which means an at-the-time apples to apples might have been 20..25x faster.
As a kind of data-compression / next optimizations side-note, the 267751 SOWPODS I compressed to 857065 bytes this way can only be Zstd -19'd down to about 588040 bytes. It's all mono-case and with simply a array-of-5-bits encoding, you could honestly get that 857065 down to (857065-267751)5+26775110 bits = 703010 bytes, less than 1.2X bigger than Zstd -19, but only 1.21X better than the more naive encoding above. So, you know, simple delta encoding works like gangbusters on a sorted spelling dictionary, has a very simple set membership algo (as above), and seems like it was at least an order of magnitude faster than what they actually did instead. I'd be a bit surprised if no one pointed this out at the time.