dividendpayee on Hacker News

I've got a big corpus of textual data (10M+ tokens) from our corporate Wiki that I'd like to plug into an LLM that our customers can use. The trouble is, I don't know how best to do that.

- I can't really train an LLM myself. That's a huge lift.

- I can use an off-the-shelf model, like GPT-3.5-Turbo, and then use their Fine Tuning API to improve the model, query-by-query. But that's not a great interface for incorporating a big block of semi-structured data.

- I could use RAG (Retrieval-Augmented Generation), so basically a clever lookup algorithm to find the right place in my textual dataset to then load into the context window and use for generation. But not all of my data lends itself cleanly to RAG.

I can use the OpenAI API to generate embeddings from my dataset, but I don't know how to then use them to augment a model or otherwise use the generated embeddings for useful search and/or generation.

How are you guys plugging your large textual datasets into LLMs? Any advice would be much appreciated.

5dividendpayee2y ago2

dividendpayee

Recent submissions

Poison, Poison Everywhere (opens in new tab)

Oskar Speck's 1932 Kayak Journey from Germany to Australia (opens in new tab)

Doge Pilled: Why Luke Farritor Followed Elon Musk to Washington (opens in new tab)

Syco Bench, a simple benchmark of LLM Sycophancy (opens in new tab)

Europe needs to escape providerism (opens in new tab)

Ask HN: How are you using your datasets with LLMs?