Context: I missed [almost] the entire A.I. wave, but I knew that one day I would have to learn something about and/or use it. That day has come. I'm allocated in one team, that is migrating to another engine, let's say "engine A → engine B". We are looking from the perspective of A, to map the entries for B (inbound), and after the request to B is returned, we map back to A's model (outbound). This is a chore, and much of the work is repetitive, but it comes with its edge cases that we need to look out for and unfortunately there isn't a solid foundation of patterns apart from the Domain-driven design (DDD) thing. It seemed like a good use case for an A.I.
Attempts: I began by asking to ChatGPT and Bard, with questions similar to: "how to train LLM on own codebase" and "how to get started with prompt engineering using own codebase".
I concluded that, fine-tuning is expensive, for large models, unrealistic for my RTX 3060 with 6Gb VRAM, no surprise there; so, I searched here, in Hacker News, for keywords like "llama", "fine-tuning", "local machine", etc, and I found out about ollama and DeepSeek.
I tried both ollama and DeepSeek, the former was slow but not as slow as the latter, which was dead slow, using a 13B model. I tried the 6/7B model (I think it was codellama) and I got reasonable results and speed. After feeding it some data, I was on my way to try and train on the codebase when a friend of mine came and suggested that I use Retrieval-Augmented Generation (RAG), I have yet to try it, with a setup Langchain + Ollama.
Any thoughts, suggestions or experiences to share?
I'd appreciate it.
for the repetitive stuff, just use copilot embedded in whatever editor you use.
the edge cases are tricky, to actually avoid these the model would need an understanding of both the use case (which is easy to describe to the model) and the code base itself (which is difficult, since description/docstring is not enough to capture the complex behaviors that can arise from interactions between parts of your codebase).
idk how you would train/finetune a model to somehow have this understanding of your code base, I doubt just doing next token prediction would help, you'd likely have to create chat data discussing the intricacies of your code base and do DPO/RLFH to bake it into your model.
look into techniques like qlora that'll reduce the needed memory during tuning. look into platforms like vast ai to rent GPUs for cheap.
RAG/Agents could be useful but probably not. could store info about functions in your codebase such as the signature, the function it calls, its docstring, and known edge cases associated with it. if you don't have docstrings using a LLM to generate them is feasible.
Paul, if you are up for that, it would be tremendously helpful to have a video(series) that shows what aider can realistically do given a boring, medium sized CRUD code base. The logs in the examples are too narrow and also build not intuition about what to do when things go wrong.
The idea is you put your code into the best possible model (GPT4) and tell it what you want and it generates code.
Realistically, since we are in a Azure ecosystem, I would use Codex to try out a solution.
Now I definitely share Linus' sentiment [1] on this topic.
It would be incredible to feed an A.I. some code and request a bug tracking from it.
Can you record input and output at some layers of your system and then use that data to test the ported code? Make sure the inputs produce the same outputs.
But you still have to read the tests and decide if that's what you want the code to do, and make sure the descriptions aren't gobbledygook.
Presumably what matters in this project is correctness, not how many unnecessary cycles you can burn.
This sounds like a job for protobufs or some kind of serialization solution. And you already know there are dragons here, so letting a LLM try and solve this is just going to mean more rework/validation for you.
If you don't understand the problem space, hire a consultant. LLMs are not consultants (yet). Either way, I'd quit wasting time on trying to feed your codebase into a LLM and just do the work.
Then don't use AI for it.
Bluntly.
This is a poor use-case; it doesn't matter what model you use, you'll get a disappointing result.
These are the domains where using AI coding currently shines:
1) You're approaching a new well established domain (eg. building an android app in kotlin), and you already know how to build things / apps, but not specifically that exact domain.
Example: How do I do X but for an android app in kotlin?
2) You're building out a generic scaffold for a project and need some tedious (but generic) work done.
Example: https://github.com/smol-ai/developer
3) You have a standard, but specific question regarding your code, and although related Q/A answers exist, nothing seems to specifically target the issue you're having.
Example: My nginx configuration is giving me [SPECIFIC ERROR] for [CONFIG FILE]. What's wrong and how can I fix it?
The domains where it does not work are:
1) You have some generic code with domain/company/whatever specific edge cases.
The edge cases, broadly speaking, no matter how well documented, will not be handled well by the model.
Edge cases are exactly that; edge cases; the common medium of 'how to x' does not cover edge cases; the edge cases will not be covered and the results will require you to review and complete them manually.
2) You have some specific piece of code you want to refactor 'to solve xxx', but the code is not covered well by tests.
LLMs struggle to refactor existing code, and the difficulty is proportional to the code length. There are technical reasons for this (mainly randomizing token weights), but tldr; it's basically a crap shot.
Might work. Might not. If you have no tests who knows? You have to manually verify both the new functionality and the old functionality, but maybe it helps a bit, at scale, for trivial problems.
3) You're doing something obscure or using a new library / new version of the library.
The LLM will have no context for this, and will generate rubbish / old deprecated content.
Obscure requirements have an unfortunate tendency to mimic the few training examples that exist, and may generate verbatim copies, depending on the model you use.
...
So. Concrete advice:
1) sigh~
> a friend of mine came and suggested that I use Retrieval-Augmented Generation (RAG), I have yet to try it, with a setup Langchain + Ollama.
Ignore this advice. RAG and langchain are not the solutions you are looking for.
2) Use a normal coding assistant like copilot.
This is the most effective way to use AI right now.
There are some frameworks that let you use open source models if you don't want to use openAI.
3) Do not attempt to bulk generate code.
AI coding isn't at that level. Right now, the tooling is primitive, and large scale coherent code generation is... not impossible, but it is difficult (see below).
You will be more effective using an existing proven path that uses 'copilot' style helpers.
However...
...if you do want to pursue code generation, here's a broad blueprint to follow:
- decompose your task into steps
- decompose you steps in functions
- generate or write tests and function definitions
- generate an api specification (eg. .d.ts file) for your function definitions
- for each function definition, generate the code for the function passing the api specification in as the context. eg. "Given functions x, y, z with the specs... ; generate an implementation of q that does ...".
- repeated generate multiple outputs for the above until you get one that passes the tests you wrote.
This approach broadly scales to reasonably complex problems, so long as you partition your problem into module sized chunks.
I personally like to put something like "you're building a library/package to do xxx" or "as a one file header" as a top level in the prompt, as it seems to link into the 'this should be isolated and a package' style of output.
However, I will caveat this with two points:
1) You generate a lot of code this way, and that's expensive if you use a charge-per-completion API.
2) The results are not always coherent and functions tend to (depending on the model, eg. 7B mistral) inline implementations for 'trivial' functions instead of using functions (eg. if you define Vector::add, the model will 50/50 just go a = new Vector(a.x + b.x, a.y + b.y)).
I've found that the current models other than GPT4 are prone to incoherence as the problem size scales.
7B models, specifically, perform significantly worse than larger models.
I'd add the MR review use case.
I have limited success with feeding a LLM (dolphin finetune of mixtral) a content of a merge request coming from my team. It was few thousand lines of added integration test code and I just couldn't be bothered/had little time to really delve.
I slapped the diff and used about 10 prompt strategies to get anything meaningful. So my first initial impressions were: clearly it was finetuned on too short responses. It kept putting in "etc.", "and other input parameters", "and other relevant information". At one point I was ready to give up; it clearly hallucinated.
Or that's what I thought: turned out there was some new edge case of a existing functionality added that was added, without ever me noticing (despite being on the same meetings).
I think it actually saved me a lot of hours or pestering other team members.
They just released a v1.5 (https://huggingface.co/deepseek-ai/deepseek-coder-7b-instruc...), but for some reason, they reduced the context length from ~16k to ~4k.
However, I have a question regarding its specific deployment method: How can I merge the parts of the Safetensors format? Specifically, I'm referring to files named 'model-00001-of-00002.safetensors' and 'model-00002-of-00002.safetensors'.
My motivation is straightforward: I aim to combine the Safetensor 'shards' and then utilize the 'convert.py' script from the llama.cpp project to transform a single .safetensors file into the GGUF format. This conversion facilitates running the models on WasmEdge.
I appreciate any guidance on this matter. Thank you.
Says so on https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instr...
They have their own license to prevent things like propaganda or military use.
I'd love to hear of a less ambiguous way to represent these.
In this context, I read it as a is better than b which is better than c.
Would you mind linking to a concise text which could lead me through setting up Mixtral on my own machine?
It’s always read HOWTO, attempt to recreate state, so I prefer sticking with low level where I also learn a bit more about the internals
C/C++ user friendliness has come as far as all the other languages and the ecosystems. Really the only reason to “fear” it is propagated memes to do so. It’s not a gun.
So I’d suggest just compile llama.cpp and install huggingface-cli to download GGUF format models, which is all ollama is doing but with even more dependencies and much more opaque outcome
DSC on the other hand crawls its way to poor answers and injected snippets of unrelated code into output for me.
(M1 Max, 64 GB RAM)
Works well but closer to a very smart code complete rather than generating much novel blocks of code
It outputs relatively correct haxe code, but it did halucinate that there is a library called 'haxe-tiled' to read tmx map files...