Show HN: Efficient Data Formats for GPT (opens in new tab)

(nikas.praninskas.com)

6 pointsnikaspran3y ago4 comments

4 comments

I ran a small test comparing different data serialization formats for use with GPT models (and possibly other LLMs). This is obviously very limited but it was striking how much of a difference switching from JSON to something like YAML could be.

I wonder if we might also see LLM specific data serialisation formats in the future, to make use of tokenization in the most efficient manner and enhance the generative capability of the models.

mmaia3y ago

Thanks for sharing this. I'm currently using JSON/JSON Schema and will consider switching to YAML.

emrah3y ago

Why does the serialization need to be text based? Why not binary formats? Or use sqlite or other db for storage and retrieval? That might also help with not having to read all the data into memory at once (although it would be slower to run)

nikaspranOP3y ago

I think it doesn't need to be specifically text based, but given that LLMs are usually trained on primarily text (at least currently), I'm not sure they'd be meaningfully able to generate binary directly.

As for using DBs, that's certainly an option (i.e. langchain and such), but at some point you do still need to bring in the data inside the context, so I'd say it's still interesting to consider what would be an efficient way to represent that data via text.

j / k navigate · click thread line to collapse

4 comments

nikaspranOP3y ago

I wonder if we might also see LLM specific data serialisation formats in the future, to make use of tokenization in the most efficient manner and enhance the generative capability of the models.

mmaia3y ago

Thanks for sharing this. I'm currently using JSON/JSON Schema and will consider switching to YAML.

emrah3y ago

nikaspranOP3y ago

j / k navigate · click thread line to collapse