It's like my previous project
https://news.ycombinator.com/item?id=33755016 but on a much larger amount of data. Rewrote everything in C++, tried to minimize allocations etc but still have some things to optimize. Current algorithm is just the cosine delta method on word/character {1,2,3}-grams as described in
https://academic.oup.com/dsh/article/32/suppl_2/ii4/3865676 combined with a few other features from the Writeprint paper.
I have no academic or professional linguistic background so it's definitely not state of the art, but it works reasonably well and from what I can tell in the literature the fancier methods work slightly better but have significantly worse performance. The beauty of the cosine delta is that I can put the vectorized representations of each author into something like hnswlib (InnerProductSpace) or Milvus and get really fast comparisons.