Velvet proxies OpenAI calls and stores the requests and responses in your PostgreSQL database. That way, you can analyze logs with SQL (instead of a clunky UI). You can also set headers to add caching and metadata (for analysis).
Backstory: We started by building some more general AI data tools (like a text-to-SQL editor). We were frustrated by the lack of basic LLM infrastructure, so ended up pivoting to focus on the tooling we wanted. So many existing apps, like Helicone, were hard to use as power users. We just wanted a database.
Scale: We’ve already warehoused 50m requests for customers, and have optimized the platform for scale and latency. We’ve built the proxy on Cloudflare Workers, and latency is nominal. We’ve built some “yak shaving” features that were really complex such as decomposing OpenAI Batch API requests so you can track each log individually. One of our early customers (https://usefind.ai/) makes millions of OpenAI requests per day, up to 1500 requests per second.
Vision: We’re trying to build development tools that have as little UI as possible, that can be controlled entirely with headers and code. We also want to blend cloud and on-prem for the best of both worlds — allowing for both automatic updates and complete data ownership.
Here are some things you can do with Velvet logs:
- Observe requests, responses, and latency
- Analyze costs by metadata, such as user ID
- Track batch progress and speed
- Evaluate model changes
- Export datasets for fine-tuning of gpt-4o-mini
(this video shows how to do each of those: https://www.youtube.com/watch?v=KaFkRi5ESi8)
--
To see how it works, try chatting with our demo app that you can use without logging in: https://www.usevelvet.com/sandbox
Setting up your own proxy is 2 lines of code and takes ~5 mins.
Try it out and let us know what you think!
https://github.com/upstash/semantic-cache
"Semantic Cache is a tool for caching natural text based on semantic similarity. It's ideal for any task that involves querying or retrieving information based on meaning, such as natural language classification or caching AI responses. Two pieces of text can be similar but not identical (e.g., "great places to check out in Spain" vs. "best places to visit in Spain"). Traditional caching doesn't recognize this semantic similarity and misses opportunities for reuse."1. great places to check out in Spain
2. great places to check out in northern Spain
Logically the two are not the same, and they could in fact be very different despite their semantic similarity. Your users will be frustrated and will hate you for it. If an LLM validates the two as being the same, then it's fine, but not otherwise.
I'm speculating here, but I wonder if you could use a two stage pipeline for cache retrieval (kinda like the distance search + reranker model technique used by lots of RAG pipelines). Maybe it would be possible to fine-tune a custom reranker model to only output True if 2 queries are semantically equivalent rather than just similar. So the hypothetical model would output True for "how to change the oil" vs. "how to replace the oil" but would output False in your Spain example. In this case you'd do distance based retrieval first using the normal vector DB techniques, and then use your custom reranker to validate that the potential cache hits are actual hits
Shameless plug (FOSS): https://github.com/jankovicsandras/plpgsql_bm25 Okapi BM25 search implemented in PL/pgSQL for Postgres.
Even in the simplest of applications where all you’re doing is passing “last user query” + “retrieved articles” into openAI (and nothing else that is different between users, like previous queries or user data that may be necessary to answer), this will be a bad experience in many cases.
Queries A and B may have similar embeddings (similar topic) and it may be correct to retrieve the same articles for context (which you could cache), but they can still be different questions with different correct answers.
> for queries that are sufficiently similar
Started with nginx proxy with rules to cache base on url/params. Wanted more control over it and explored lua/redis apis, and opted to build a app to do be a little more smart for what i wanted. Extra ec2 cost is negligible compared to cache savings.
Have you had thoughts on how to you might integrate data from an upstream RAG pipeline, say as a part of a distributed trace, to aid in debugging the core "am I talking to the LLM the right way" use case?
From personal experience, they're all pretty simple to install and use. Then mileage varies in analyzing and taking action on the logs. Does Velvet offer something the others do not?
For my client projects, I've been leaning towards open source platforms like Arize so clients have the option of pulling it inhouse if needed. Most often for HIPAA requirements.
RAG support would be great to add to Velvet. Specifically pgvector and pinecone traces. But maybe Velvet already supports it and I missed it in the quick read of the docs.
We warehouse logs directly to your DB, so you can do whatever you want with the data. Build company ops on top of the DB, run your own evals, join with other tables, hash data, etc.
We’re focusing on backend eng workflows so it’s simple to run continuous monitoring, evals, and fine-tuning with any model. Our interface will focus on surfacing data and analytics to PMs and researchers.
For pgvector/pinecone RAG traces - you can start by including meta tags in the header. Those values will be queryable in the JSON object.
Curious to learn more though - feel free to email me at emma@usevelvet.com.
I believe proxy-based implementations like Velvet are excellent for getting started and solve for the immediate debugging use case; simply changing the base path of the OpenAI SDK makes things really simple (the other solutions mentioned typically require a few more minutes to set up).
At Langfuse (similarly to the other solutions mentioned above), we prioritize asynchronous and batched logging, which is often preferred for its scalability and zero impact on uptime and latency. We have developed numerous integrations (for openai specifically an SDK wrapper), and you can also use our SDKs and Decorators to integrate with any LLM.
> For my client projects, I've been leaning towards open source platforms like Arize so clients have the option of pulling it inhouse if needed. Most often for HIPAA requirements.
I can echo this. We observe many self-hosted deployments in larger enterprises and HIPAA-related companies, thus we made it very simple to self-host Langfuse. Especially when PII is involved, self-hosting makes adopting an LLM observability tool much easier in larger teams.
There are plenty of other solutions (examples include Presto, Athena, Redshift, or straight up jq over raw log files on disk) which are better suited for this use case. Storing log data in a relational DB is pretty much always an anti-pattern, in my experience.
Here's a video about what we do with the data: https://www.youtube.com/watch?v=KaFkRi5ESi8
PostgreSQL (Neon) is our free self-serve offering because it’s easy to spin up quickly.
May I ask what you specifically were frustrated about? Seems like there are more than enough solutions
Check out our quickstart for an example of what that looks like! https://docs.smith.langchain.com/
Also, caught a few typos on the site: https://triplechecker.com/s/o2d2iR/usevelvet.com?v=qv9Qk
403: Forbidden ID: bom1::k5dng-1727242244208-0aa02a53f334