I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: https://github.com/diasks2/pragmatic_segmenter#the-golden-ru...
Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.
Also at the CLDR sentence break supressions: http://unicode.org/cldr/trac/browser/tags/release-27-0-1/com...
If your rules treat an edge case that the above don't it'd probably be worth trying to suggest improvements to the unicode rules or the locale-specific ones.
Have you tried to evaluate your splitter on some other data, on this "typically used corpora"? The evaluation quality looks too optimistic - 98% / 100% quality means you made your code to work on your examples, but by using only a set of standartized tests you can't check:
* how broad is the coverage - there are other edge cases in real world, it may be impossible to cover them all;
* that the splitter doesn't make mistakes for real-world "regular" sentences (80-90% of sentences which are "the same").
The example set looks very good, and it looks like a good way to compare other sentence splitters. But it is not fair to provide evaluation metrics on the examples you used to develop your sentence splitter.
1) Most segmentation research papers are done by Universities which have access to the Penn Treebank data (WSJ and Brown corpus). However, the cost of that data is $1,700 https://catalog.ldc.upenn.edu/LDC99T42
2) The Brown corpus is available for free in NLTK (http://www.nltk.org/nltk_data/). However it is the tagged corpus. I've contacted the researchers for all of the top segmentation libraries but never received an answer to any of the following questions:
a) I’m assuming you preprocessed the text by removing the tags. Is this correct? Or did you use the untagged version, and if so do you have a link to that as I only found the tagged version in the NLTK data?
b) When removing the tags did you also remove each carriage return and newline so the text was one long string, each sentence separated by just one whitespace?
c) The download contains 100+ files. Did you analyze each individually? Or did you create one combined file? If you created a combined file how did you space each individual file within the larger file? Also, if you combined them what order did you combine them in?
So sure, all of these papers use the same data, but we have no idea if they are actually using that data in the same way, as none of the papers actually release their code and tests, or tell the steps they used to preprocess the corpus.
To test more broad coverage on my library I added the full text of Alice in Wonderland https://github.com/diasks2/pragmatic_segmenter/blob/master/s.... A grad student from Stanford offered to test my library on the WSJ corpus a few months ago which was very kind, but I'm still waiting to hear back on that.
http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b... (link to GitHub repo in post)