undefined | Better HN

0 pointstremarley1y ago0 comments

ebooks are a 1-2 mb each max. 81.7 TB are a lot of books, like 42-85 million books.

0 comments

The article says they got datasets from Anna's Archive. It was most likely the scihub/libgen torrent which is 96.0 TB right now and contains 92,872,581 files. That's about 1 megabyte per file.

https://annas-archive.org/datasets

southernplaces71y ago

Where does one find these torrent datasets? Did they download the books in bits and pieces or as a single huge multi-TB file?

thunkingdeep1y ago

I’ve got 70-80mb pirated books, I think because of the illustrations. Guess it depends on the book.

mateus11y ago

I don’t think they’re using picture heavy book for LLM training, no?

RIMR1y ago

Just because the LLMs are trained on text doesn't mean that images we're a part of what they downloaded.

You clean up the data after you acquire it, not before.

littlestymaar1y ago

Even if they didn't use the illustration(which isn't clear given multimodal models), they'd still make use the text in the books.

WithinReason1y ago

Presumably they didn't create the torrent

1 more reply

moralestapia1y ago

Yes they do, there's multimodal models.

rbanffy1y ago

I don't think they need to be selective. It's not like Meta can run out of storage.

mnsu1y ago

For multi-modal models, why not? They would be probably some of the best data.

1 more reply

hulitu1y ago

Why not ? Do you think that AI doesn't enjoy porn ? /s

squigz1y ago

It could be anywhere from a few million to a hundred million

https://annas-archive.org/datasets

j / k navigate · click thread line to collapse

0 comments

weberer1y ago

The article says they got datasets from Anna's Archive. It was most likely the scihub/libgen torrent which is 96.0 TB right now and contains 92,872,581 files. That's about 1 megabyte per file.

https://annas-archive.org/datasets

southernplaces71y ago

Where does one find these torrent datasets? Did they download the books in bits and pieces or as a single huge multi-TB file?

thunkingdeep1y ago

I’ve got 70-80mb pirated books, I think because of the illustrations. Guess it depends on the book.

mateus11y ago

I don’t think they’re using picture heavy book for LLM training, no?

RIMR1y ago

Just because the LLMs are trained on text doesn't mean that images we're a part of what they downloaded.

You clean up the data after you acquire it, not before.

littlestymaar1y ago

Even if they didn't use the illustration(which isn't clear given multimodal models), they'd still make use the text in the books.

WithinReason1y ago

Presumably they didn't create the torrent

1 more reply

moralestapia1y ago

Yes they do, there's multimodal models.

rbanffy1y ago

I don't think they need to be selective. It's not like Meta can run out of storage.

mnsu1y ago

For multi-modal models, why not? They would be probably some of the best data.

1 more reply

hulitu1y ago

Why not ? Do you think that AI doesn't enjoy porn ? /s

squigz1y ago

It could be anywhere from a few million to a hundred million

https://annas-archive.org/datasets

j / k navigate · click thread line to collapse