How to Split Sentences (2014) (opens in new tab)

(tech.grammarly.com)

88 pointsf00biebletch11y ago12 comments

12 comments

I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (https://github.com/diasks2/pragmatic_segmenter).

I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: https://github.com/diasks2/pragmatic_segmenter#the-golden-ru...

Xeoncross11y ago

Great work. I love the "Golden Rules" list you compiled. It seems like teams develop their NLP systems without sharing a common training set which leaves some teams without testing things like the "a.m. / p.m." thing.

diasks211y ago

See my comment below for some of the reasons I've had issues trying to test the commonly used segmentation corpora. I completely agree it would be great if there was a free (as in both speech and beer) common training set. One key would be that this common training set either provide the exact text that should be run in the segmenter or exact instructions on how to produce the text to run in the segmenter (re: see the issue I mentioned below of the ambiguity around how to actually test the Brown corpus).

kylebgorman11y ago

For comparability, most people use the Penn Treebank-III WSJ data. Sections 03-06 are test, the remaining sections are train/dev.

Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.

ldng11y ago

Not only there is little sharing, it is very focused on English.

edwintorok11y ago

Worth taking a look at the unicode sentence segmentation algorithm rules: http://unicode.org/reports/tr29/#Sentence_Boundaries

Also at the CLDR sentence break supressions: http://unicode.org/cldr/trac/browser/tags/release-27-0-1/com...

If your rules treat an edge case that the above don't it'd probably be worth trying to suggest improvements to the unicode rules or the locale-specific ones.

kmike8411y ago

Looks good!

Have you tried to evaluate your splitter on some other data, on this "typically used corpora"? The evaluation quality looks too optimistic - 98% / 100% quality means you made your code to work on your examples, but by using only a set of standartized tests you can't check:

* how broad is the coverage - there are other edge cases in real world, it may be impossible to cover them all;

* that the splitter doesn't make mistakes for real-world "regular" sentences (80-90% of sentences which are "the same").

The example set looks very good, and it looks like a good way to compare other sentence splitters. But it is not fair to provide evaluation metrics on the examples you used to develop your sentence splitter.

diasks211y ago

Good points. I'd love to test it on some of the typically used corpora. The issues I have are:

1) Most segmentation research papers are done by Universities which have access to the Penn Treebank data (WSJ and Brown corpus). However, the cost of that data is $1,700 https://catalog.ldc.upenn.edu/LDC99T42

2) The Brown corpus is available for free in NLTK (http://www.nltk.org/nltk_data/). However it is the tagged corpus. I've contacted the researchers for all of the top segmentation libraries but never received an answer to any of the following questions:

a) I’m assuming you preprocessed the text by removing the tags. Is this correct? Or did you use the untagged version, and if so do you have a link to that as I only found the tagged version in the NLTK data?

b) When removing the tags did you also remove each carriage return and newline so the text was one long string, each sentence separated by just one whitespace?

c) The download contains 100+ files. Did you analyze each individually? Or did you create one combined file? If you created a combined file how did you space each individual file within the larger file? Also, if you combined them what order did you combine them in?

So sure, all of these papers use the same data, but we have no idea if they are actually using that data in the same way, as none of the papers actually release their code and tests, or tell the steps they used to preprocess the corpus.

To test more broad coverage on my library I added the full text of Alice in Wonderland https://github.com/diasks2/pragmatic_segmenter/blob/master/s.... A grad student from Stanford offered to test my library on the WSJ corpus a few months ago which was very kind, but I'm still waiting to hear back on that.

1 more reply

kylebgorman11y ago

Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the "standard split". It is also very fast.

http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b... (link to GitHub repo in post)

jbrooksuk11y ago

I've just added this kind of support to node-summary (https://github.com/jbrooksuk/node-summary) which seems to make a bit of a positive difference under the tests.

andrewtbham11y ago

is anyone aware of a sentence segmenter for poorly written english that is missing some punctation? like from chat sessions? it could be useful for normal sentence segmentation. i.e. if you forget about the punctuation, can you detect the boundaries of the sentence anyway.

j / k navigate · click thread line to collapse

12 comments

diasks211y ago

I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (https://github.com/diasks2/pragmatic_segmenter).

Xeoncross11y ago

diasks211y ago

kylebgorman11y ago

For comparability, most people use the Penn Treebank-III WSJ data. Sections 03-06 are test, the remaining sections are train/dev.

Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.

ldng11y ago

Not only there is little sharing, it is very focused on English.

edwintorok11y ago

Worth taking a look at the unicode sentence segmentation algorithm rules: http://unicode.org/reports/tr29/#Sentence_Boundaries

Also at the CLDR sentence break supressions: http://unicode.org/cldr/trac/browser/tags/release-27-0-1/com...

If your rules treat an edge case that the above don't it'd probably be worth trying to suggest improvements to the unicode rules or the locale-specific ones.

kmike8411y ago

Looks good!

* how broad is the coverage - there are other edge cases in real world, it may be impossible to cover them all;

* that the splitter doesn't make mistakes for real-world "regular" sentences (80-90% of sentences which are "the same").

diasks211y ago

Good points. I'd love to test it on some of the typically used corpora. The issues I have are:

b) When removing the tags did you also remove each carriage return and newline so the text was one long string, each sentence separated by just one whitespace?

1 more reply

kylebgorman11y ago

Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the "standard split". It is also very fast.

http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b... (link to GitHub repo in post)

jbrooksuk11y ago

I've just added this kind of support to node-summary (https://github.com/jbrooksuk/node-summary) which seems to make a bit of a positive difference under the tests.

andrewtbham11y ago

j / k navigate · click thread line to collapse