Meanwhile, I just wrote a custom tokeniser for my fan control experiment.
It features such amusements as:
- Tokens representing the current time of day and day of week, with half-hour granularity. [14:30][Monday], as the debugger reports.
- An entirely separate set of numeric tokens for CPU usage and such, on a logarithmic scale. Also features tokens for digit position, measured from the right.
- A hardcoded text tokeniser for executable paths. [/nix/store](..cut..)/bin/executable name. I didn't feel like using the usual approach, so I built a huffman compressor to generate the tokens for arbitrary text, because why not.
- Tokens representing program state - "just started", "long-running", etc.
- Tokens representing the fact that the following text is from `tail -f ~/.bash_history`.
- Start-of-segment tokens for each of the above, and also for GPU and CPU core complex power usage.
It's not that many tokens in total, and the input is structured data, so why not represent it as such? I still had sixty-five thousand tokens for the text tokeniser.