DSPy: Framework for programming with foundation models (opens in new tab)

(github.com)

141 pointsokhat2y ago52 comments

52 comments

This is really cool! It took me a bit to understand what this thing is for so allow me to summarize:

For simple LLM tasks, don't bother using this tool. It won't do much for you.

If you have a more complicated task (eg. knowledge database lookups, chain of thought reasoning, multi-hop lookups...) then DSPy offers 2 things: a clean class-based representation of your workflow, and a way to *solve* for the best prompt structure to solve your problem.

To me, the last part is the most interesting because it promises to eliminate tedious prompt engineering. All you need is a set of examples to "train" your prompts on.

wokwokwok2y ago

Did you actually try it and find it useful or are you just speculating?

I was not particularly impressed by the tutorial notebook. I’m not sure I believe that automatic prompt generation is nearly as easy as it sounds.

What task did you try it on?

okhatOP2y ago

What did you find underwhelming if I may ask?

It shows you how it takes some ~25 Pythonic lines of code to make GPT-3.5 retrieval accuracy go from the 26-36% range to 60%.

Not a bad deal when you apply it to your own problem?

wokwokwok2y ago

The examples appear to knowledge retrieval and factoids only.

The concept appears to be large scale chain of thought and automatic prompt generate and fine tuning… but there don’t appear to be actual examples of this.

1 more reply

simonw2y ago

I really want to understand that second part! I think that's the part I haven't been able to get my head around yet.

okhatOP2y ago

Here's the key idea.

You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].

It simulates your code on the inputs. When there's an LM call, it will make one or more simple zero-shot calls that respect your declarative signature. Think of this like a more general form of "function calling" if you will. It's just trying out things to see what passes your validation logic, but it's a highly-constrained search process.

The constraints enforced by the signature (per LM call) and the validation metric allow the compiler [with some metaprogramming tricks] to gather "good" and "bad" examples of execution for every step in which your code calls an LM. Even if you have no labels for it, because you're just exploring different pipelines. (Who has time to label each step?)

For now, we throw away the bad examples. The good examples become potential demonstrations. The compiler can now do an optimization process to find the best combination of these automatically bootstrapped demonstrations in the prompts. Maybe the best on average, maybe (in principle) the predicted best for a specific input. There's no magic here, it's just optimizing your metric.

The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.

In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.

thelastbender122y ago

Hi Omar - thanks for engaging here. I have a similar question to simonw, it _feels_ like there is something useful here but I haven't managed to grok it yet, after sitting through the tutorial notebooks.

Specifically, to your description above, I'd love seeing specific retrieval examples, where you need more-complex pipelines. Zero shot QA (1-step), few-shot QA (2-step), retrieval + few-shot QA (3-step) all make sense, but when the README starts talking about demonstrations, I can't really follow when is that actually needed. Also, it starts feeling too magical when you introduce "smaller LMs" since I don't know what those are.

apstls2y ago

I'm trying to wrap my head around this project too, since it does seem interesting. Similar to what OP wrote, the sense I got from poking around (and of course from reading the bit in the README that basically says exactly this) was that there are two distinct pieces here, the first being a nice, clean library for working directly with LLMs that refreshingly lacks the assumptions and brittle abstractions found in many current LLM frameworks, and the second being everything related to automatic optimization of prompts. The second half is the part I'm trying to better understand - more specifically, I understand that it uses a process to generate and select examples that are then added to the prompt, but am unclear if it's also doing any prompt transformations other than these example-related improvements. I guess to put it another way, if one were to reframe the second half as a library for automatic n-shot example generation and optimization, made possible via the various cool things this project has implemented like the spec language/syntax, is there anything lost or not covered by the new framing?

As more of an aside, I gave the paper a quick skim and plan on circling back to it when I have more time - are the ideas in the paper an accurate/complete representation of the under-the-hood workings, and general type of optimizations being performed, of the current state of the project?

As another related aside, I vaguely remember coming across this a month or two ago and coming away with a different impression/understanding of it at the time - has the framing of or documentation for the project changed substantially recently, or perhaps the scope of the project itself? I seem to recall focusing mostly on the LM and RM steps and reading up a bit on retrieval model options afterwards. I could very well be mixing up projects or just had focused on the wrong things the first time around of course.

1 more reply

sandGorgon2y ago

would love your thoughts on this as well - https://github.com/arakoodev/edgechains

got frustrated in the same way with "Black Box Prompting - every library hides prompts/chains in layers of libraries...while it should have been declarative.

EdgeChains - allows u to specify ur prompt and chain in jsonnet. This why i think Generative AI needs declarative orchestration and not previous generations. https://github.com/arakoodev/edgechains#why-do-you-need-decl...

bobdvb2y ago

Sadface there was me thinking this was a new and interesting Digital Signal Processing framework.

spacemadness2y ago

The fact it doesn’t stand for Digital Signal Processing really really bothers me.

IKantRead2y ago

Just yesterday I was thinking to myself "I wonder how long it will be until we start generating prompts using code and then we're just writing code again"

naillo2y ago

There probably is a cool mix of both that is better than either one separately. I'm thinking something like

  func = llm('function that sorts two input argument lists')

where llm calls openai or a local llm (cached for later use). This way you don't lose the benefits of a coding interface (e.g. make all code with openai and the maintainability mess that can come with that). And you get readability through the prompt etc etc. (I mean this project is sort of in that direction already.) It's basically like writing code with a framework that is 'filled out' automatically by a coworker.

Worth being creative with ideas like this at least.

qsort2y ago

You're not the first to think that to yourself :)

> It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system.

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

gabereiser2y ago

I welcome programming English++ with open arms so long as I can scold it when it makes mistakes and it doesn’t require move semantics.

I’m joking of course but I do think LLM’s will become part of the programming language lexer of some kind. If not already being looked into.

tevon2y ago

Introducing non-deterministic inputs into programs is already wild.

I think once we get past "make the LLM generate #$%!#@ JSON" we will start seeing a lot more of this, since we will then be able to constrain the code-paths that are followed.

I could absolutely see LLM-powered operators being introduced at some point, that we use just like an if statement.

Imagine "if(input like LLM(greeting))" or while(LLM(input) like "interested")

essentially distilling unstructured inputs into a canonical form to be used in control flows:

Switch(LLM(input)): - case request - case question - case query

two_in_one2y ago

Why not the other way around? LLM interfaced with a controller which recognizes commands in text stream and can insert data.

If it's too dark call 'HAL.lights_on' with the current room as a parameter.

"Tell me how old Napoleon would be today." "Today Napoleon, assuming he is alive, would be calc(today - data.Napoleon.birthdate).year"

Difwif2y ago

Looks interesting and seems to not make some of the mistakes that other frameworks make (langchain, llamaindex, etc.) I was pretty apprehensive when I looked at your short hand signature API. I'm really not a fan of these custom mini languages, but it looks like it's fairly constrained at the moment and has an expanded form with a sane Python API. My only concern is that the short hand signatures spiral out of control with "features" and we're back to debugging new languages with poor tooling.

If a new language is the correct answer for working with LLMs than I would prefer a real language spec, compiler/interpreter, debugger, and language server before considering adoption. Which is a lot of work ofc and why I'm apprehensive of how it grows.

tevon2y ago

Is anyone working on this? I'd love be able to work with a full spec.

I think colang from Nvidia seems like the "closest" so far, and that is pretty specifically focused on chatbots which is to bad.

data_maan2y ago

> seems to not make some of the mistakes that other frameworks make (langchain, llamaindex, etc.)

Which mistakes, specifically, do you mean they make?

tevon2y ago

I'd love to see some more examples!

This seems like it could be particularly useful in restricting the domain of the LLM. I.e. ensuring it outputs specific values and few-shot training these steps.

I think there may a really powerful integration here with LMQL. Has there been any thinking about doing so? LMQL seems quite powerful in constraining values and limiting the number of tokens used. This could be a great option for some steps. I.e. a categorization "is this task type X or type Y" before continuing in the logic flow.

teho982y ago

How does the compilation logic work? It’s described as optimizing the prompts just like you optimize the weights of a neural net, but what does that look like in practice?

okhatOP2y ago

See the discussion of teleprompters here:

https://colab.research.google.com/github/stanfordnlp/dspy/bl...

okhatOP2y ago

DSPy provides composable and declarative modules for instructing LMs in a familiar Pythonic syntax and an automatic compiler that teaches LMs how to conduct the declarative steps in your program. Specifically, the DSPy compiler will internally trace your program and then craft high-quality prompts for large LMs (or train automatic finetunes for small LMs) to teach them the steps of your task.

simonw2y ago

"Specifically, the DSPy compiler will internally trace your program and then craft high-quality prompts for large LMs"

I'm having trouble understanding the value provided here.

A prompt is a string. Why abstract that string away from me like this?

My instinct is that this will make it harder, not easier, for me to understand what's going on and make necessary changes.

okhatOP2y ago

"A neural network layer is just a matrix. Why abstract that matrix and learn it?" Well, because it's not your job to figure out how to hardcode delicate string or floats that work well for a given architecture & backend.

We want developers to iterate quickly on system designs: How should we break down the task? Where do we call LMs? What should they do?

---

If you can guess the right prompts right away for each LLM, tweak them well for any complex pipeline, and rarely have to change the pipeline (and hence all prompts in it), then you probably won't need this.

That said, it turns out that (a) prompts that work well are very specific to particular LMs, large & especially small ones, (b) prompts that work well change significantly when you tweak your pipeline or your data, and (c) prompts that work well may be long and time-consuming to find.

Oh, and often the prompt that works well changes for different inputs. Thinking in terms of strings is a glaring anti-pattern.

simonw2y ago

I agree with you on all of those points - but my conclusion is different: those are the reasons it's so important to me that the prompts are not abstracted away from me!

I'm working with Llama 2 a bunch at the moment and much of the challenge is learning how to prompt it differently from how I prompt GPT-4. I'm not yet convinced that an abstraction will solve that problem for me.

1 more reply

okhatOP2y ago

@simonw it sounds like we'd agree that:

1] when prototyping, it's useful to not have to tweak each prompt by hand as long as you can inspect them easily

2] when the system design is "final", it's important to be able to tweak any prompts or finetunes with full flexibility

But we may or may not agree on:

3] automatic optimization can basically make #2 above only very rarely needed

---

Anyway, the entire DSPy project has zero hard-coded prompts for tasks. It's all bootstrapped and validated for your logic. In case you're worried that we're doing some opinionated prompting on your behalf.

1 more reply

bionhoward2y ago

One time (ironically, after I learned about model free methods) I got sucked into writing a heuristic for an A* algorithm. It turned into a bottomless pit of manually tuning various combinations of rules. I learned the value of machine learning the hard way.

If prompts can be learned, then eventually it will be better to learn them than to manually tune them. However, these ideas need not be mutually exclusive. When we reject the tyranny of the “or” and we can have a prompt prior we manually tune and then update it with a learning process, right?

P.S. whoever wrote the title, I think it’s pretty silly to write “The Framework…” for anything because this presumes you have the only member of some category, which is never true!

pama2y ago

Agreed. I don’t understand the trend for abstracting away the best possible human interface.

okhatOP2y ago

btw read a more official answer here:

https://github.com/stanfordnlp/dspy#5a-dspy-vs-thin-wrappers...

fassssst2y ago

Because there are magic strings that work considerably better than what you might come up with. Like “think step by step.”

outside12342y ago

How is this different from LangChain?

davidrupp2y ago

https://github.com/stanfordnlp/dspy#5b-dspy-vs-application-d...

okhatOP2y ago

just posted a top-level answer, copied from the FAQs of DSPy

tatrajim2y ago

Looks very promising. I will await with interest more developed tutorials.

behnamoh2y ago

when will it be an abandoned like the rest of the projects in this field?

j / k navigate · click thread line to collapse

52 comments

grandma_tea2y ago

This is really cool! It took me a bit to understand what this thing is for so allow me to summarize:

For simple LLM tasks, don't bother using this tool. It won't do much for you.

To me, the last part is the most interesting because it promises to eliminate tedious prompt engineering. All you need is a set of examples to "train" your prompts on.

wokwokwok2y ago

Did you actually try it and find it useful or are you just speculating?

I was not particularly impressed by the tutorial notebook. I’m not sure I believe that automatic prompt generation is nearly as easy as it sounds.

What task did you try it on?

okhatOP2y ago

What did you find underwhelming if I may ask?

It shows you how it takes some ~25 Pythonic lines of code to make GPT-3.5 retrieval accuracy go from the 26-36% range to 60%.

Not a bad deal when you apply it to your own problem?

wokwokwok2y ago

The examples appear to knowledge retrieval and factoids only.

The concept appears to be large scale chain of thought and automatic prompt generate and fine tuning… but there don’t appear to be actual examples of this.

1 more reply

simonw2y ago

I really want to understand that second part! I think that's the part I haven't been able to get my head around yet.

okhatOP2y ago

Here's the key idea.

You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].

The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.

In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.

thelastbender122y ago

apstls2y ago

1 more reply

sandGorgon2y ago

would love your thoughts on this as well - https://github.com/arakoodev/edgechains

got frustrated in the same way with "Black Box Prompting - every library hides prompts/chains in layers of libraries...while it should have been declarative.

bobdvb2y ago

Sadface there was me thinking this was a new and interesting Digital Signal Processing framework.

spacemadness2y ago

The fact it doesn’t stand for Digital Signal Processing really really bothers me.

IKantRead2y ago

Just yesterday I was thinking to myself "I wonder how long it will be until we start generating prompts using code and then we're just writing code again"

naillo2y ago

There probably is a cool mix of both that is better than either one separately. I'm thinking something like

  func = llm('function that sorts two input argument lists')

Worth being creative with ideas like this at least.

qsort2y ago

You're not the first to think that to yourself :)

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

gabereiser2y ago

I welcome programming English++ with open arms so long as I can scold it when it makes mistakes and it doesn’t require move semantics.

I’m joking of course but I do think LLM’s will become part of the programming language lexer of some kind. If not already being looked into.

tevon2y ago

Introducing non-deterministic inputs into programs is already wild.

I think once we get past "make the LLM generate #$%!#@ JSON" we will start seeing a lot more of this, since we will then be able to constrain the code-paths that are followed.

I could absolutely see LLM-powered operators being introduced at some point, that we use just like an if statement.

Imagine "if(input like LLM(greeting))" or while(LLM(input) like "interested")

essentially distilling unstructured inputs into a canonical form to be used in control flows:

Switch(LLM(input)): - case request - case question - case query

two_in_one2y ago

Why not the other way around? LLM interfaced with a controller which recognizes commands in text stream and can insert data.

If it's too dark call 'HAL.lights_on' with the current room as a parameter.

"Tell me how old Napoleon would be today." "Today Napoleon, assuming he is alive, would be calc(today - data.Napoleon.birthdate).year"

Difwif2y ago

tevon2y ago

Is anyone working on this? I'd love be able to work with a full spec.

I think colang from Nvidia seems like the "closest" so far, and that is pretty specifically focused on chatbots which is to bad.

data_maan2y ago

> seems to not make some of the mistakes that other frameworks make (langchain, llamaindex, etc.)

Which mistakes, specifically, do you mean they make?

tevon2y ago

I'd love to see some more examples!

This seems like it could be particularly useful in restricting the domain of the LLM. I.e. ensuring it outputs specific values and few-shot training these steps.

teho982y ago

How does the compilation logic work? It’s described as optimizing the prompts just like you optimize the weights of a neural net, but what does that look like in practice?

okhatOP2y ago

See the discussion of teleprompters here:

https://colab.research.google.com/github/stanfordnlp/dspy/bl...

okhatOP2y ago

simonw2y ago

"Specifically, the DSPy compiler will internally trace your program and then craft high-quality prompts for large LMs"

I'm having trouble understanding the value provided here.

A prompt is a string. Why abstract that string away from me like this?

My instinct is that this will make it harder, not easier, for me to understand what's going on and make necessary changes.

okhatOP2y ago

We want developers to iterate quickly on system designs: How should we break down the task? Where do we call LMs? What should they do?

---

Oh, and often the prompt that works well changes for different inputs. Thinking in terms of strings is a glaring anti-pattern.

simonw2y ago

I agree with you on all of those points - but my conclusion is different: those are the reasons it's so important to me that the prompts are not abstracted away from me!

1 more reply

okhatOP2y ago

@simonw it sounds like we'd agree that:

1] when prototyping, it's useful to not have to tweak each prompt by hand as long as you can inspect them easily

2] when the system design is "final", it's important to be able to tweak any prompts or finetunes with full flexibility

But we may or may not agree on:

3] automatic optimization can basically make #2 above only very rarely needed

---

1 more reply

bionhoward2y ago

P.S. whoever wrote the title, I think it’s pretty silly to write “The Framework…” for anything because this presumes you have the only member of some category, which is never true!

pama2y ago

Agreed. I don’t understand the trend for abstracting away the best possible human interface.

okhatOP2y ago

btw read a more official answer here:

https://github.com/stanfordnlp/dspy#5a-dspy-vs-thin-wrappers...

fassssst2y ago

Because there are magic strings that work considerably better than what you might come up with. Like “think step by step.”

outside12342y ago

How is this different from LangChain?

davidrupp2y ago

https://github.com/stanfordnlp/dspy#5b-dspy-vs-application-d...

okhatOP2y ago

just posted a top-level answer, copied from the FAQs of DSPy

tatrajim2y ago

Looks very promising. I will await with interest more developed tutorials.

behnamoh2y ago

when will it be an abandoned like the rest of the projects in this field?

j / k navigate · click thread line to collapse