* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).
* I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) code for training. Training took about 1 month on 4x RTX8000 GPUs.
* You can download the trained model here: https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab here: https://moyix.net/~moyix/csrc_dataset_large.json.gz https://moyix.net/~moyix/csrc_vocab_large.zip
Happy to answer any questions!
Did people find it to be as challenging when you showed it to them as some of us are here? Did you expect that level of complexity?
Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:
correct | total | pct
---------+-------+-----
6529 | 12963 | 50The list of packages the real snippets are drawn from is here (maybe if you want to avoid using them... ;) ):
https://moyix.net/~moyix/sample_pkgnames.txt
Note that the GPT samples are prompted with 128 characters randomly selected from those same packages, so you will see GPT2-generated code that mentions the package name etc. However, these packages were not used for training.
What was the scenario? I had a couple of small, fixed-size char buffers and I wanted to swap their valid portions, but the obvious choice of swap_ranges(a, b, a + max(na, nb)) would run into this issue. (n.b. this wouldn't be correct for non-POD types anyway, but we're talking about chars.)
On top of it being annoying to not be able to do the convenient thing, it made life harder when debugging, because the "correct" solution does not preservs the bit patterns (0xCC/0xCD or whatever) that the debug build injects into uninitialized arrays, therefore making it harder to tell when I later read an uninitialized element from a swapped-from array.
It's possible training for a month may be too much.
Sadly, I can't do back and see it again
I got 8/9
TIFF is a real thing, so some human was involved in some part of that code, it has just been garbled up by GPT2... In other words, the training set is showing quite visibly in the result
I understand that GPT-2/3 is just a very smart parrot that has no semantic knowledge of what it is outputting. Like I guess let's take a very dumb markov chain that was "trained" on the following input:
a.c ``` int array[6]; array[5] = 1; ```
b.c
``` int array[4]; array[3] = 2; ```
I guess a markov chain could theoretically produce the following code:
out.c
``` int array[4]; array[5] = 1; ```
which is undefined behaviour. But it is produced from two programs where no undefined behaviour is present. A better question would be, how can we guarantee that certain program invariants (like lack of undefined behaviour) can be preserved in the produced code? Or if there are no guarantees, can we calculate a probability? (Sorry, not an expert on machine learning, just excited about a potentially new way to fuzz programs. Technically, one could just instrument C programs with sanitizers and produce backwards static slices from the C sanitizer instrumentation to the beginning of the program and you get a sample of a program without undefined behaviour... so there is already the potential for some training set to be expanded beyond what Csmith can provide.)
EDIT: I don't know how to format here...
The real examples have worse comments at times.
The only flaw is that it shows fake code most of the time so you can game it that way.
bool hasdied // has died
and then a `// done` for seemingly no reason after initializing variables... where did this code come from?!Functions with lots of arguments while the body consists of "return true;"
I guess it tells what AI often tells us about ourselves: That what we do makes much less sense than we think it does. It is thus easy to fake.
How is it possible to churn out so much music or so many books, or so much software? Well, because most creative works are either not very original or are quite procedural or random.
And this kind of work could be automated indeed (or examined if it needs to be done in the first place).
I'm also pretty sure there are formatting, commenting, and in-string-text "tells" that indicate whether something is GPT2 reliably. Maybe I should try training an AI to figure that out...
GPT2's code looks like correct code at a glance but when you try to understand what it's doing, that's when you understand that it could not have been written by a human.
It's similar to the articles produced by GPT3; they have the right form but no substance.
Wrong, of course. Now maybe the human concerned was far down some inheritance tree and simply wanted to document misuse of a deliberately limited class but assert()ing would have been too punitive. Or may it was indeed "right form but no substance", but authored by an SWE.
I did manage to get 100% correct after a while but it takes a thorough reading of the code.
It reminds me of how GPT-3 is good at producing a certain sort of bad writing.
My guess as to why this happens: we humans have abilities of logical reasoning, clarity, and purposefulness that GPT doesn't have. When we use those abilities, we produce writing and code that GPT can't match. If we don't, though, our output isn't much better than GPT's.
It's harder to do with some of the smaller excerpts though, and I'm sure there are probably examples of terrible human programmers who write worse code than GPT-2.
I really wish the site let me view my previous guesses.
You have no idea. It'll make you hate computers
What's next? Advices? Feedbacks? Rests?
I give ups.
From <https://icannwiki.org/.codes>:
The .CODES TLD is attractive and useful to end-users as it better facilitates search, self-expression, information sharing and the provision of legitimate goods and services.
.kim - Kim (Korean surname)
.lol - LOL: laughing out loud
.organic - organic gardeners, farmers, foods, etc.
.plumbing - plumbing businesses
[0] https://en.wikipedia.org/wiki/List_of_Internet_top-level_dom...