undefined | Better HN

0 pointsygjb1y ago0 comments

Sure, just on the training front, building and maintaining a broad corpus of properly managed training data with metadata that provides attribution (for example, content that is known to be human generated instead of model generated, what the source of data is for datasets such as weather data, census data, etc), and that also captures any licensing encumbrance so that consumers of the training data can be confident in their ability to use it without risk of legal challenge.

Much of this is already available to private sector entities, but having a publicly funded organization responsible for curating and publishing this would enable new entrants to quickly and easily get a foundation without having to scrape the internet again, especially given how rapidly model generated content is being published.

0 comments

mnahkies1y ago

I think the EPC (energy performance certificate) dataset in the UK is a nice example of this. Anyone can download a full dataset of EPC data from https://epc.opendatacommunities.org/

Admittedly it hasn't been cleaned all that much - you still need to put a bit of effort into that (newer certificates tend to be better quality), but it's very low friction overall. I'd love to see them do this with more datasets

randomdata1y ago

If the public is going to go to all the trouble of doing something, why would that public not make it clear that there is no legal threat to using any data available?

The public is incredibly lazy, though. Don't expect them to do anything until their hand is forced, which doesn't bode well for the action to meet a desirable outcome.

j / k navigate · click thread line to collapse