For simple LLM tasks, don't bother using this tool. It won't do much for you.
If you have a more complicated task (eg. knowledge database lookups, chain of thought reasoning, multi-hop lookups...) then DSPy offers 2 things: a clean class-based representation of your workflow, and a way to *solve* for the best prompt structure to solve your problem.
To me, the last part is the most interesting because it promises to eliminate tedious prompt engineering. All you need is a set of examples to "train" your prompts on.
I was not particularly impressed by the tutorial notebook. I’m not sure I believe that automatic prompt generation is nearly as easy as it sounds.
What task did you try it on?
It shows you how it takes some ~25 Pythonic lines of code to make GPT-3.5 retrieval accuracy go from the 26-36% range to 60%.
Not a bad deal when you apply it to your own problem?
You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].
It simulates your code on the inputs. When there's an LM call, it will make one or more simple zero-shot calls that respect your declarative signature. Think of this like a more general form of "function calling" if you will. It's just trying out things to see what passes your validation logic, but it's a highly-constrained search process.
The constraints enforced by the signature (per LM call) and the validation metric allow the compiler [with some metaprogramming tricks] to gather "good" and "bad" examples of execution for every step in which your code calls an LM. Even if you have no labels for it, because you're just exploring different pipelines. (Who has time to label each step?)
For now, we throw away the bad examples. The good examples become potential demonstrations. The compiler can now do an optimization process to find the best combination of these automatically bootstrapped demonstrations in the prompts. Maybe the best on average, maybe (in principle) the predicted best for a specific input. There's no magic here, it's just optimizing your metric.
The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.
In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.
got frustrated in the same way with "Black Box Prompting - every library hides prompts/chains in layers of libraries...while it should have been declarative.
EdgeChains - allows u to specify ur prompt and chain in jsonnet. This why i think Generative AI needs declarative orchestration and not previous generations. https://github.com/arakoodev/edgechains#why-do-you-need-decl...
func = llm('function that sorts two input argument lists')
where llm calls openai or a local llm (cached for later use). This way you don't lose the benefits of a coding interface (e.g. make all code with openai and the maintainability mess that can come with that). And you get readability through the prompt etc etc. (I mean this project is sort of in that direction already.) It's basically like writing code with a framework that is 'filled out' automatically by a coworker.Worth being creative with ideas like this at least.
> It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system.
https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...
I’m joking of course but I do think LLM’s will become part of the programming language lexer of some kind. If not already being looked into.
I think once we get past "make the LLM generate #$%!#@ JSON" we will start seeing a lot more of this, since we will then be able to constrain the code-paths that are followed.
I could absolutely see LLM-powered operators being introduced at some point, that we use just like an if statement.
Imagine "if(input like LLM(greeting))" or while(LLM(input) like "interested")
essentially distilling unstructured inputs into a canonical form to be used in control flows:
Switch(LLM(input)): - case request - case question - case query
If a new language is the correct answer for working with LLMs than I would prefer a real language spec, compiler/interpreter, debugger, and language server before considering adoption. Which is a lot of work ofc and why I'm apprehensive of how it grows.
I think colang from Nvidia seems like the "closest" so far, and that is pretty specifically focused on chatbots which is to bad.
Which mistakes, specifically, do you mean they make?
This seems like it could be particularly useful in restricting the domain of the LLM. I.e. ensuring it outputs specific values and few-shot training these steps.
I think there may a really powerful integration here with LMQL. Has there been any thinking about doing so? LMQL seems quite powerful in constraining values and limiting the number of tokens used. This could be a great option for some steps. I.e. a categorization "is this task type X or type Y" before continuing in the logic flow.
https://colab.research.google.com/github/stanfordnlp/dspy/bl...
I'm having trouble understanding the value provided here.
A prompt is a string. Why abstract that string away from me like this?
My instinct is that this will make it harder, not easier, for me to understand what's going on and make necessary changes.
We want developers to iterate quickly on system designs: How should we break down the task? Where do we call LMs? What should they do?
---
If you can guess the right prompts right away for each LLM, tweak them well for any complex pipeline, and rarely have to change the pipeline (and hence all prompts in it), then you probably won't need this.
That said, it turns out that (a) prompts that work well are very specific to particular LMs, large & especially small ones, (b) prompts that work well change significantly when you tweak your pipeline or your data, and (c) prompts that work well may be long and time-consuming to find.
Oh, and often the prompt that works well changes for different inputs. Thinking in terms of strings is a glaring anti-pattern.
https://github.com/stanfordnlp/dspy#5a-dspy-vs-thin-wrappers...