As for different sources reporting on the same thing, that's a bit harder.
There is a subtler variation on this. Two different urls resolving to the same article. A site may publish:
- http://foo.com/date/bar/some-inane-tech-article
- http://foo.com/date/some-inane-tech-article
Both urls point to the same article, both are unique but point to the same document. A quick example might be an article and the same article printed.
Here is an live example I just spotted:
original ~ http://www.paulgraham.com/ycombinator.html ~ post ~ http://news.ycombinator.com/item?id=133430
dupe ~ http://paulgraham.com/ycombinator.html ~ post ~ http://news.ycombinator.com/item?id=134775
http://foo.com/bar?aritrary-var=arbitrary-val
to just
that would help because a lot of people end up posting links with ?source=newsletter or &sessionid=asdf1234ilikepie or etc
Google does some pretty cool probability stuff (at least, I think that's how they do it) to figure out what articles are the same for news.google.com. Something like that would be really cool on new.yc.
I know, the source is open.... but I'm clearly busy ;)