Anthropic's original take home assignment open sourced (opens in new tab)

(github.com)

639 pointsmyahio4mo ago376 comments

376 comments

I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.

epolanski4mo ago

Computing is a very broad topic. Even Linus or Carmack have no skills or knowledge about countless topics that would be mundane to you.

It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.

Our ability to process and master new topics is part of the job.

I'm sure you've done that countless times.

TrackerFF4mo ago

Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.

Onavo4mo ago

If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.

3 more replies

johnnyanmac4mo ago

>Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems

I'll take any interviews at this point in time.

But yes, every domain has its jargon. I work tangentially to this and quickly understood this as a GPGPU problem. A relatively elementary one if you studied this space, though a time limit of 2 hours seems overly restrictive if you aren't actively studying this stuff.

fergie4mo ago

I'm 30 years in, and literally don't understand the question.

WithinReason4mo ago

After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)

The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.

1 more reply

bsder4mo ago

Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.

However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".

That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.

Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.

I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!

1 more reply

mike_hearn4mo ago

The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.

If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:

    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
    available time, as measured by test_kernel_cycles on a frozen separate copy
    of the simulator.

However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:

https://github.com/anthropics/original_performance_takehome/...

This function is described only as:

    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.

The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.

What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.

At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.

This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.

And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.

So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.

Does that help? Sounds like a fun exercise :)

Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.

8 more replies

measurablefunc4mo ago

Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.

¹https://github.com/anthropics/original_performance_takehome/...

²https://github.com/anthropics/original_performance_takehome/...

1 more reply

PeterStuer4mo ago

Which part exactly are ypu having trouble with?

- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator

karmajunkie4mo ago

Thank goodness, I thought it was just me...

mangatmodi4mo ago

Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.

It's not about you being average, just a different knowledge set.

xenihn4mo ago

It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.

chistev4mo ago

What we know is a drop, what we don't know is an ocean.

elzbardico4mo ago

There's a big chance you're falling in a subtle form of imposter syndrome that manifests itself by largely over-estimating the average skill level.

But this is good. Staying humble makes you hungrier for learning.

ActorNightly4mo ago

Yours is a good mentality to have because it creates the emotional drive to learn more, so don't lose that. That being said, this isn't really that complicated. Its just a matter of taking enough time to look at the code and understand how its structured. I feel like the thing that differentiates developer skill is pretty much being able to do that, specifically in the process of having the model of the program in your head.

sigbottle4mo ago

Does it?

For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".

For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development

gervwyk4mo ago

Don’t stress, its very likely that this problem was vibe coded :) It’s insane how much better Claude Code is compared to alternatives lately.

LouisSayers4mo ago

It's the type of thing you'd be exposed to in a computer science degree - operating systems / compilers.

Always room to learn in software :)

deadbabe4mo ago

If you think you’re average, you’re not average.

apsurd4mo ago

disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.

the hot take is, there are other games.

tuetuopay4mo ago

This is the opposite of leet code.

Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)

This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.

saagarjha4mo ago

This is explicitly not Leetcode, in fact its goal is to attract optimizers

sevenzero4mo ago

Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?

2 more replies

pvalue0054mo ago

I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)

bjackman4mo ago

Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".

I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.

aiiotnoodle4mo ago

I've noticed using antigravity and vscode, Gemini 3 pro often comes back with model too busy or something like that and basically 500s.

Seems like capacity because it works a lot better late at night.

I don't see the same with the claude models in antigravity.

3 more replies

mixel4mo ago

I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already

sva_4mo ago

I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong

1 more reply

bird08614mo ago

Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.

pvalue0054mo ago

/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash

After ~40 minutes, it got to:

The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.

It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.

I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.

5 more replies

bird08614mo ago

Hilarious that this got a downvote, hello Satya!

Mashimo4mo ago

> sucks dog crap through a coffee straw.

That would be impressive.

2 more replies

languid-photic4mo ago

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m

Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".

lawrencechen4mo ago

codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.

jstummbillig4mo ago

Will you look at this man's prompting skills?!

dudewhocodes4mo ago

Serious prompt engineering right here

mettamage4mo ago

Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?

1 more reply

HarHarVeryFunny4mo ago

That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).

The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.

ponyous4mo ago

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

a24j4mo ago

Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.

languid-photic4mo ago

Sure!

https://github.com/voratiq/voratiq

1 more reply

raphaelj4mo ago

Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?

kevinday4mo ago

I tried GLM-4.7 running locally on a beefy GPU server, in about 3 minutes it got to 25846 cycles, but then struggled in circles for about 90 minutes without making any meaningful progress, making the same mistakes repeatedly and misdiagnosing the cause most of the time. It seems to understand what needs to happen to reach the goal, but keeps failing on the implementation side. It seemed to understand that to beat the target an entirely new approach would be required (it kept leaning towards a wavefront design), but wasn't seeing the solution due to the very limited ISA.

forgotpwd164mo ago

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

kitrak954mo ago

Are you giving instructions to a stranger on the internet?

2 more replies

giancarlostoro4mo ago

I do wonder how Grok would compare, specifically their Claude Code Fast model.

game_the0ry4mo ago

> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.

This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.

paxys4mo ago

This is simply to enter the recruiting pipeline. once you're in you will do the same leetcode interviews as everyone else.

alt2274mo ago

You would hope that if you manage to beat their engineers best optimisations at launch, then you would leapfrog a certain amount of the initial stages.

Then again, this may just be a way to get free ideas at optimising their product from outside the box.

1 more reply

driverdan4mo ago

Is this a fact or an assumption?

yodsanklai4mo ago

It would take something like one week full time to work on this. It's not something you can do if you have a full-time job and apply to several other companies. I find it unreasonable to ask a candidate to spend that much time for an uncertain result.

It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.

tcoff914mo ago

As long as there are qualified candidates willing to do unreasonable tasks for the chance to work at a company, there's not much incentive for the company to change their system. Those people will also probably work unreasonably hard and make unreasonable sacrifices for the company.

menaerus4mo ago

> It's not something you can do if you have a full-time job

> I find it unreasonable to ask a candidate to spend that much time

And same for some reason does not apply to leetcode style interviews?

> It would take something like one week full time to work on this

I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.

> Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.

No, it is not. This is specifically the type of job you would be doing tomorrow at Anthropic team if hired. And they are specifically hiring people who are already good enough at that very task. The same cannot be said for the leetcode, not even remotely comparable.

abra04mo ago

This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!

lukah4mo ago

I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.

usgroup4mo ago

https://awesomeclaude.ai/ralph-wiggum

clocksmith3mo ago

Did you get an offer?

avaer4mo ago

It's pretty interesting how close this assignment looks to demoscene [1] golf [2].

[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf

It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...

wiz21c4mo ago

I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).

But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?

saagarjha4mo ago

It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem

1 more reply

KeplerBoy4mo ago

perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.

nice_byte4mo ago

it's designed to select for people who can be trusted to manually write ptx :-)

sureglymop4mo ago

Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.

As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.

forgotpwd164mo ago

Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.

fhd24mo ago

Their Readme.md is weirdly obsessed with "2 hours":

"before Claude Opus 4.5 started doing better than humans given only 2 hours"

"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"

"Claude Opus 4.5 after 2 hours in our test-time compute harness"

"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"

So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.

Would be quite curious to know though. How I usually design take home assignments is:

1. Candidate has several _days_ to complete (usually around a week).

2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.

But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.

alcasa4mo ago

No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.

1 more reply

amirhirsch4mo ago

I'm at 1137 with one hour with opus now... Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...

I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....

WithinReason4mo ago

Submit it to the leaderboard: https://www.kerneloptimization.fun/

amirhirsch4mo ago

I think I can hit #1 (current #1 is 1000). sub 900 not possible though.

Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..

so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free

With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.

1 more reply

menaerus4mo ago

Why do you need an X account for it? Seems like a ridiculous requirement

lalaland11254mo ago

How do you avoid the load bottleneck?

amirhirsch4mo ago

======================================================================

BROADCAST LOAD SCHEDULE

======================================================================

Round | Unique | Load Strategy

------|--------|------------------------------------------

   0  |    1   | 1 broadcast → all 256 items

   1  |    2   | 2 broadcasts → groups

   2  |    4   | 4 broadcasts → groups

   3  |    8   | 8 broadcasts → groups

   4  |   16   | 16 broadcasts → groups

   5  |   32   | 32 broadcasts → groups

   6  |   63   | 63 loads (sparse, use indirection)

   7  |  108   | 108 loads (sparse, use indirection)

   8  |  159   | 159 loads (sparse, use indirection)

   9  |  191   | 191 loads (sparse, use indirection)

  10  |  224   | 224 loads (sparse, use indirection)

  11  |    1   | 1 broadcast → all 256 items

  12  |    2   | 2 broadcasts → groups

  13  |    4   | 4 broadcasts → groups

  14  |    8   | 8 broadcasts → groups

  15  |   16   | 16 broadcasts → groups

Total loads with grouping: 839

Total loads naive: 4096

Load reduction: 4.9x

amirhirsch4mo ago

take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)

1 more reply

fabian44mo ago

[flagged]

1 more reply

bytesandbits4mo ago

Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.

petters4mo ago

And the answer to the obvious follow-up question is...?

mrklol4mo ago

Milk before cereals

1 more reply

darkwater4mo ago

Maybe it's under NDA :)

kevthecoder4mo ago

reader92744mo ago

fries

1 more reply

bytesandbits3mo ago

imbue

koolba4mo ago

What is the actual assignment here?

The README only gives numbers without any information on what you’re supposed to do or how you are rated.

glalonde4mo ago

"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py

vermilingua4mo ago

Think that means you failed :(

nice_byte4mo ago

being cryptic and poorly specified is part of the assignment

just like real code

in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.

2 more replies

NightBlossom4mo ago

I just withdrew my application over this test. It forces an engineering anti-pattern: requiring runtime calculation for static data (effectively banning O(1) pre-computation).

When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.

It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.

hackern39724mo ago

This isn't the gotcha moment you think it is. Storing the result on disk is some stupid "erm achkually" type solution that goes against the spirit of the optimization problem.

They want to see how you handle low level optimizations, not get tripped over some question semantics.

NightBlossom4mo ago

You are missing the point. This isn't "storing result on disk." In high-performance engineering, if the input is static and known at build time, the only correct optimization is pre-computation.

I didn't simply "skip" the problem. I implemented a compiler that solves the problem entirely at build time, resulting in O(0) runtime execution.

Here is the actual "Theorem" I implemented in my solution. If a test penalizes this approach because it "goes against the spirit," then the test is fundamentally testing for inefficiency.

""" Theorem 1 (Null Execution): Let P: M → M be a program with postcondition φ(M). If ∃M' s.t. φ(M') ∧ M ≅ M', then T(P) = 0.

Complexity: O(n) compile-time, O(0) runtime """

If they wanted to test runtime loop optimizations, they should have made the inputs dynamic.

nine_k4mo ago

This is a kind of task that's best solved by possibly spending more than the allocated 2 hours on it, once any obvious low-hanging fruit is picked. An optimization task is what a machine does best. So the real problem would be to construct a machine that would be able to run the optimization. A right optimization framework that results from the effort could also efficiently solve many more similar problems in the future.

I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.

saagarjha4mo ago

> So the real problem would be to construct a machine that would be able to run the optimization.

This is a valid way to solve the problem.

tucnak4mo ago

The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?

ahussain4mo ago

They wrote:

That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”

0x3f4mo ago

I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.

2 more replies

lovich4mo ago

That paraphrases to

"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"

It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.

3 more replies

riffraff4mo ago

I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".

Bootvis4mo ago

Should have asked Claude how to write it better.

maerch4mo ago

In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.

1 more reply

NewJazz4mo ago

They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)

kristopolous4mo ago

If you're an asshole that wants millions of dollars...i mean there's still places to say no

sourcegrift4mo ago

Pride comes before fall thankfully

altmanaltman4mo ago

its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.

FriendlyMike4mo ago

They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here

ec1096854mo ago

Solvable in more than 2 but not less than 2 would be the real trick.

OisinMoran4mo ago

"You have 1 minute to design a maze that takes 2 minutes to solve"

NitpickLawyer4mo ago

The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.

If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.

cgearhart4mo ago

I think this is the actual “bitter lesson”—the scalable solution (letting LLMs bang against the problem nonstop) will eventually far outperform human effort. There will come a point—whether sooner or later—where this’ll be the expected norm for handling such problems. I think the only question is whether there is any distinction between problems like this (clearly defined with a verifiable outcome) vs the space of all interesting computer programs. (At the moment I think there’s space between them. TBD.)

lostmsu4mo ago

1% doesn't sound like a lot at all.

_aavaa_4mo ago

That depends on how close to the theoretical max you think they are.

myahioOP4mo ago

Sakana is a grift from what I understand

NitpickLawyer4mo ago

Eh. I'd call them overly enthusiastic :) I know they publish hype-y stuff, they jumped the gun on a few things, I get that. But their recent result was on a "live" contest, and they did share agent traces, so that's likely a legit result.

tayo424mo ago

I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.

piokoch4mo ago

How something that generates next token, given a list of previous tokens, can do something novel?

rellfy4mo ago

By that same logic, humans would not be able to do anything novel either.

LarsKrimi4mo ago

I liked the core challenge. Finding the balance of ALU and VALU, but I think that the problem with the load bandwidth could lead to problems

Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun

If it however had some kind of dynamic vector lane rotate that could have been way more interesting

eisbaw4mo ago

I got to 1364 cycles for now, semi-manually: Using design space exploration organized via backlog.md project, and then recombination from that. 20 agents in parallel.

Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.

Edit: 1121 cycles

karmasimida4mo ago

Same just make it a survival game

eisbaw4mo ago

1023 cycles

seamossfet4mo ago

I'm getting flashbacks from my computer engineering curriculum. Probably the first place I'd start is replacing comparison operators on the ALU with binary arithmetic since it's much faster than branch logic. Next would probably be changing the `step` function from brute iterators on the instructions to something closer to a Btree? Then maybe a sparse set for the memory management if we're going to do a lot of iterations over the flat memory like this.

Maro4mo ago

> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.

Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?

Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?

saagarjha4mo ago

4 hours

mrklol4mo ago

Oh, I thought candidates got 2 hours but now I am confused too

throwaway0123_54mo ago

> Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours

Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.

pickpocket4mo ago

I cleared this assignment but did not clear the follow up interview that was way easier than this. So I gave up on tech interviews in general, stayed where I was.

arsl164mo ago

I got this but I am an embedded SWE, might not be my cup of tea

kristianpaul4mo ago

“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”

afro884mo ago

> at launch

Does this confirm they actually do knee cap models after the launch period to save money, without telling users?

mediaman4mo ago

No, they later updated the harness for this and it subsequently got better scores.

sevenzero4mo ago

The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.

nottorp4mo ago

Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?

atomlib4mo ago

I'm afraid that position is already filled by the CEO.

falloutx4mo ago

It should be "can you gaslight a CEO into firing 90% of their software engineers?"

demirbey054mo ago

It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM

measurablefunc4mo ago

The task is ill-defined.

saagarjha4mo ago

You make it faster

1 more reply

torginus4mo ago

Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.

saagarjha4mo ago

Yes, in fact this will be one of the first things you will want to do.

Incipient4mo ago

>so we can be appropriately impressed and perhaps discuss interviewing.

Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.

I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.

sponnath4mo ago

I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.

qbane4mo ago

Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.

yodsanklai4mo ago

Thanks for noticing this. I got the same feeling when reading this. It may not sound like much, and it doesn't mean it's an insufferable place to work, but it's a hint it might be.

Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).

mips_avatar4mo ago

Going through the assignment now. Man it’s really hard to pack the vectors right

svilen_dobrev4mo ago

if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.

https://github.com/svilendobrev/transit-python3

htrp4mo ago

Idle side note: surprised that https://github.com/anthropic is just some random dude in Australia

arsl164mo ago

Fellas should I even attempt it? I got it recently and lets say it brings back memories of computer architecture class.

spencerflem4mo ago

Oh wow it’s by Tristan Hume, still remember you from EyeLike!

Graziano_M4mo ago

I recognized the name and dug around too. I played DEFCON CTF with him back in the day!

karmasimida4mo ago

I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll

lostmsu4mo ago

Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).

piokoch4mo ago

Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.

Aurornis4mo ago

When this was being used it was probably given to candidates who had already started the interview loop and been screened.

The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.

cjrp4mo ago

I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.

saagarjha4mo ago

Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…

greesil4mo ago

This is a knowledge test of GPU architecture?

avaer4mo ago

Kind of, but not any particular GPU.

The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...

But presumably similar principles apply.

benreesman4mo ago

It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.

This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.

saagarjha4mo ago

You can get pretty far without needing to care about this fwiw

1 more reply

sublimefire4mo ago

Did a bit of soul searching and manually optimised to 1087 but I give up. What is the number we are chasing here? IMO I would not join a company giving such a vague problem because you can feel really bad afterwards, especially if this does not open a door to the next stage of the interview. As an alternative we could all instead focus on a real kernel and improve it :)

trishume4mo ago

Author of the take-home here: That's quite a good cycle count, substantially better than Claude's, you should email it to performance-recruiting@anthropic.com.

pshirshov4mo ago

Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.

potato-peeler4mo ago

What does clock cycles mean? Don’t think they are referring to the cpu clock?

NightBlossom4mo ago

I could only cut it down to 41 cycles.

pickpocket4mo ago

i cleared this one but didn't clear the follow up interview that was way easier than this

mayankd4mo ago

Problem solving is eternal!

zeroCalories4mo ago

It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.

pclmulqdq4mo ago

I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.

heavyset_go4mo ago

I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.

Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.

Does help that I have a very public web presence and portfolio, though.

3 more replies

Aurornis4mo ago

> I generally have a policy of "over 4 hours and I charge for my time.

Worth mentioning that demanding to be paid to apply for a company is usually equivalent to rejecting the job. Most companies are going to end the interview there. Few HR departments would allow one applicant to be paid for the same interview loop as other candidates.

I was helping out in a mentoring program during the ZIRP period when the idea of charging companies for take-home interviews started to become popular. I can’t think of anyone it actually worked for in that group. I’ve heard anecdotes online of some people doing it with success, but any company like Anthropic is just going to close your application and move on if you request to be paid for applying. They have a zillion other qualified candidates in line.

If someone is giving a take-home problem that looks like you’re actually doing work for the company, that’s a different story. This problem is not actually work, obviously.

1 more reply

whateveracct4mo ago

4 hours continuous or no? I can't imagine finding 4 hours of straight focus.

2 more replies

djmips4mo ago

If you look at it as a puzzle game then it's not any different than the time you use to play other games.

aleph_minus_one4mo ago

> it's not any different than the time you use to play other games.

This assumes that the candidate has a lot of time for playing other games.

browningstreet4mo ago

I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.

throwa3562624mo ago

Care to elaborate the first part?

Did you apply for a position? Did they send you the assignment without prior discussion?

sealeck4mo ago

Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?

0x3f4mo ago

The expected time you spend on it is much less than the expected time they'll spend on it.

efilife4mo ago

you don't get paid for it

mips_avatar4mo ago

It’s kind of an interesting problem.

dhruv30064mo ago

I wonder if OpenAI follows suit.

rvz4mo ago

They should.

SinghCoder4mo ago

why is their github handle anthropics and not anthropic :D

alexpadula4mo ago

Looks rather fun!

mrdootdoot4mo ago

“In English, Data”

mannykannot4mo ago

I beat the target by deleting the parts that were causing the cycle count to be too high. /s

eisbaw4mo ago

submit and see if Anthropic accepts it

jackblemming4mo ago

Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

onion2k4mo ago

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.

I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.

tmule4mo ago

Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)

jackblemming4mo ago

And they’re all dumber than John von Neumann, who cares?

1 more reply

muglug4mo ago

If they're hiring performance engineers then they're hiring for exactly these sets of skills.

It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.

Analemma_4mo ago

This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.

rvz4mo ago

> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

Good. That should be the minimum requirement.

Not another Next.js web app take home project.

saagarjha4mo ago

The solution was explicitly graded on creativity fwiw

j / k navigate · click thread line to collapse

376 comments

lbreakjai4mo ago

epolanski4mo ago

Computing is a very broad topic. Even Linus or Carmack have no skills or knowledge about countless topics that would be mundane to you.

It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.

Our ability to process and master new topics is part of the job.

I'm sure you've done that countless times.

TrackerFF4mo ago

Onavo4mo ago

3 more replies

johnnyanmac4mo ago

>Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems

I'll take any interviews at this point in time.

fergie4mo ago

I'm 30 years in, and literally don't understand the question.

WithinReason4mo ago

The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.

1 more reply

bsder4mo ago

Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.

That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.

Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.

I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!

1 more reply

mike_hearn4mo ago

    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
    available time, as measured by test_kernel_cycles on a frozen separate copy
    of the simulator.

https://github.com/anthropics/original_performance_takehome/...

This function is described only as:

    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.

Does that help? Sounds like a fun exercise :)

8 more replies

measurablefunc4mo ago

¹https://github.com/anthropics/original_performance_takehome/...

²https://github.com/anthropics/original_performance_takehome/...

1 more reply

PeterStuer4mo ago

Which part exactly are ypu having trouble with?

- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator

karmajunkie4mo ago

Thank goodness, I thought it was just me...

mangatmodi4mo ago

Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.

It's not about you being average, just a different knowledge set.

xenihn4mo ago

chistev4mo ago

What we know is a drop, what we don't know is an ocean.

elzbardico4mo ago

There's a big chance you're falling in a subtle form of imposter syndrome that manifests itself by largely over-estimating the average skill level.

But this is good. Staying humble makes you hungrier for learning.

ActorNightly4mo ago

sigbottle4mo ago

Does it?

For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".

For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development

gervwyk4mo ago

Don’t stress, its very likely that this problem was vibe coded :) It’s insane how much better Claude Code is compared to alternatives lately.

LouisSayers4mo ago

It's the type of thing you'd be exposed to in a computer science degree - operating systems / compilers.

Always room to learn in software :)

deadbabe4mo ago

If you think you’re average, you’re not average.

apsurd4mo ago

disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.

the hot take is, there are other games.

tuetuopay4mo ago

This is the opposite of leet code.

This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.

saagarjha4mo ago

This is explicitly not Leetcode, in fact its goal is to attract optimizers

sevenzero4mo ago

2 more replies

pvalue0054mo ago

bjackman4mo ago

aiiotnoodle4mo ago

I've noticed using antigravity and vscode, Gemini 3 pro often comes back with model too busy or something like that and basically 500s.

Seems like capacity because it works a lot better late at night.

I don't see the same with the claude models in antigravity.

3 more replies

mixel4mo ago

I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already

sva_4mo ago

I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong

1 more reply

bird08614mo ago

Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.

pvalue0054mo ago

/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash

After ~40 minutes, it got to:

It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.

I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.

5 more replies

bird08614mo ago

Hilarious that this got a downvote, hello Satya!

Mashimo4mo ago

> sucks dog crap through a coffee straw.

That would be impressive.

2 more replies

languid-photic4mo ago

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m

Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".

lawrencechen4mo ago

codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.

jstummbillig4mo ago

Will you look at this man's prompting skills?!

dudewhocodes4mo ago

Serious prompt engineering right here

mettamage4mo ago

Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?

1 more reply

HarHarVeryFunny4mo ago

ponyous4mo ago

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

a24j4mo ago

Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.

languid-photic4mo ago

Sure!

https://github.com/voratiq/voratiq

1 more reply

raphaelj4mo ago

Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?

kevinday4mo ago

forgotpwd164mo ago

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

kitrak954mo ago

Are you giving instructions to a stranger on the internet?

2 more replies

giancarlostoro4mo ago

I do wonder how Grok would compare, specifically their Claude Code Fast model.

game_the0ry4mo ago

This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.

paxys4mo ago

This is simply to enter the recruiting pipeline. once you're in you will do the same leetcode interviews as everyone else.

alt2274mo ago

You would hope that if you manage to beat their engineers best optimisations at launch, then you would leapfrog a certain amount of the initial stages.

Then again, this may just be a way to get free ideas at optimising their product from outside the box.

1 more reply

driverdan4mo ago

Is this a fact or an assumption?

yodsanklai4mo ago

tcoff914mo ago

menaerus4mo ago

> It's not something you can do if you have a full-time job

> I find it unreasonable to ask a candidate to spend that much time

And same for some reason does not apply to leetcode style interviews?

> It would take something like one week full time to work on this

I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.

> Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.

abra04mo ago

lukah4mo ago

I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.

usgroup4mo ago

https://awesomeclaude.ai/ralph-wiggum

clocksmith3mo ago

Did you get an offer?

avaer4mo ago

It's pretty interesting how close this assignment looks to demoscene [1] golf [2].

[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf

It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...

wiz21c4mo ago

But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?

saagarjha4mo ago

It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem

1 more reply

KeplerBoy4mo ago

perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.

nice_byte4mo ago

it's designed to select for people who can be trusted to manually write ptx :-)

sureglymop4mo ago

Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.

As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.

forgotpwd164mo ago

fhd24mo ago

Their Readme.md is weirdly obsessed with "2 hours":

"before Claude Opus 4.5 started doing better than humans given only 2 hours"

"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"

"Claude Opus 4.5 after 2 hours in our test-time compute harness"

"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"

Would be quite curious to know though. How I usually design take home assignments is:

1. Candidate has several _days_ to complete (usually around a week).

But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.

alcasa4mo ago

1 more reply

amirhirsch4mo ago

I'm at 1137 with one hour with opus now... Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...

I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....

WithinReason4mo ago

Submit it to the leaderboard: https://www.kerneloptimization.fun/

amirhirsch4mo ago

I think I can hit #1 (current #1 is 1000). sub 900 not possible though.

1 more reply

menaerus4mo ago

Why do you need an X account for it? Seems like a ridiculous requirement

lalaland11254mo ago

How do you avoid the load bottleneck?

amirhirsch4mo ago

======================================================================

BROADCAST LOAD SCHEDULE

======================================================================

Round | Unique | Load Strategy

------|--------|------------------------------------------

   0  |    1   | 1 broadcast → all 256 items

   1  |    2   | 2 broadcasts → groups

   2  |    4   | 4 broadcasts → groups

   3  |    8   | 8 broadcasts → groups

   4  |   16   | 16 broadcasts → groups

   5  |   32   | 32 broadcasts → groups

   6  |   63   | 63 loads (sparse, use indirection)

   7  |  108   | 108 loads (sparse, use indirection)

   8  |  159   | 159 loads (sparse, use indirection)

   9  |  191   | 191 loads (sparse, use indirection)

  10  |  224   | 224 loads (sparse, use indirection)

  11  |    1   | 1 broadcast → all 256 items

  12  |    2   | 2 broadcasts → groups

  13  |    4   | 4 broadcasts → groups

  14  |    8   | 8 broadcasts → groups

  15  |   16   | 16 broadcasts → groups

Total loads with grouping: 839

Total loads naive: 4096

Load reduction: 4.9x

amirhirsch4mo ago

take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)

1 more reply

fabian44mo ago

[flagged]

1 more reply

bytesandbits4mo ago

Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.

petters4mo ago

And the answer to the obvious follow-up question is...?

mrklol4mo ago

Milk before cereals

1 more reply

darkwater4mo ago

Maybe it's under NDA :)

kevthecoder4mo ago

reader92744mo ago

fries

1 more reply

bytesandbits3mo ago

imbue

koolba4mo ago

What is the actual assignment here?

The README only gives numbers without any information on what you’re supposed to do or how you are rated.

glalonde4mo ago

"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py

vermilingua4mo ago

Think that means you failed :(

nice_byte4mo ago

being cryptic and poorly specified is part of the assignment

just like real code

2 more replies

NightBlossom4mo ago

I just withdrew my application over this test. It forces an engineering anti-pattern: requiring runtime calculation for static data (effectively banning O(1) pre-computation).

When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.

hackern39724mo ago

This isn't the gotcha moment you think it is. Storing the result on disk is some stupid "erm achkually" type solution that goes against the spirit of the optimization problem.

They want to see how you handle low level optimizations, not get tripped over some question semantics.

NightBlossom4mo ago

You are missing the point. This isn't "storing result on disk." In high-performance engineering, if the input is static and known at build time, the only correct optimization is pre-computation.

I didn't simply "skip" the problem. I implemented a compiler that solves the problem entirely at build time, resulting in O(0) runtime execution.

Here is the actual "Theorem" I implemented in my solution. If a test penalizes this approach because it "goes against the spirit," then the test is fundamentally testing for inefficiency.

""" Theorem 1 (Null Execution): Let P: M → M be a program with postcondition φ(M). If ∃M' s.t. φ(M') ∧ M ≅ M', then T(P) = 0.

Complexity: O(n) compile-time, O(0) runtime """

If they wanted to test runtime loop optimizations, they should have made the inputs dynamic.

nine_k4mo ago

saagarjha4mo ago

> So the real problem would be to construct a machine that would be able to run the optimization.

This is a valid way to solve the problem.

tucnak4mo ago

The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?

ahussain4mo ago

They wrote:

0x3f4mo ago

I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.

2 more replies

lovich4mo ago

That paraphrases to

"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"

It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.

3 more replies

riffraff4mo ago

I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".

Bootvis4mo ago

Should have asked Claude how to write it better.

maerch4mo ago

In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.

1 more reply

NewJazz4mo ago

They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)

kristopolous4mo ago

If you're an asshole that wants millions of dollars...i mean there's still places to say no

sourcegrift4mo ago

Pride comes before fall thankfully

altmanaltman4mo ago

its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.

FriendlyMike4mo ago

They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here

ec1096854mo ago

Solvable in more than 2 but not less than 2 would be the real trick.

OisinMoran4mo ago

"You have 1 minute to design a maze that takes 2 minutes to solve"

NitpickLawyer4mo ago

If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.

cgearhart4mo ago

lostmsu4mo ago

1% doesn't sound like a lot at all.

_aavaa_4mo ago

That depends on how close to the theoretical max you think they are.

myahioOP4mo ago

Sakana is a grift from what I understand

NitpickLawyer4mo ago

tayo424mo ago

I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.

piokoch4mo ago

How something that generates next token, given a list of previous tokens, can do something novel?

rellfy4mo ago

By that same logic, humans would not be able to do anything novel either.

LarsKrimi4mo ago

I liked the core challenge. Finding the balance of ALU and VALU, but I think that the problem with the load bandwidth could lead to problems

Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun

If it however had some kind of dynamic vector lane rotate that could have been way more interesting

eisbaw4mo ago

I got to 1364 cycles for now, semi-manually: Using design space exploration organized via backlog.md project, and then recombination from that. 20 agents in parallel.

Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.

Edit: 1121 cycles

karmasimida4mo ago

Same just make it a survival game

eisbaw4mo ago

1023 cycles

seamossfet4mo ago

Maro4mo ago

> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.

Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?

Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?

saagarjha4mo ago

4 hours

mrklol4mo ago

Oh, I thought candidates got 2 hours but now I am confused too

throwaway0123_54mo ago

> Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours

pickpocket4mo ago

I cleared this assignment but did not clear the follow up interview that was way easier than this. So I gave up on tech interviews in general, stayed where I was.

arsl164mo ago

I got this but I am an embedded SWE, might not be my cup of tea

kristianpaul4mo ago

afro884mo ago

> at launch

Does this confirm they actually do knee cap models after the launch period to save money, without telling users?

mediaman4mo ago

No, they later updated the harness for this and it subsequently got better scores.

sevenzero4mo ago

The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.

nottorp4mo ago

Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?

atomlib4mo ago

I'm afraid that position is already filled by the CEO.

falloutx4mo ago

It should be "can you gaslight a CEO into firing 90% of their software engineers?"

demirbey054mo ago

It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM

measurablefunc4mo ago

The task is ill-defined.

saagarjha4mo ago

You make it faster

1 more reply

torginus4mo ago

saagarjha4mo ago

Yes, in fact this will be one of the first things you will want to do.

Incipient4mo ago

>so we can be appropriately impressed and perhaps discuss interviewing.

Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.

I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.

sponnath4mo ago

I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.

qbane4mo ago

Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.

yodsanklai4mo ago

Thanks for noticing this. I got the same feeling when reading this. It may not sound like much, and it doesn't mean it's an insufferable place to work, but it's a hint it might be.

mips_avatar4mo ago

Going through the assignment now. Man it’s really hard to pack the vectors right

svilen_dobrev4mo ago

https://github.com/svilendobrev/transit-python3

htrp4mo ago

Idle side note: surprised that https://github.com/anthropic is just some random dude in Australia

arsl164mo ago

Fellas should I even attempt it? I got it recently and lets say it brings back memories of computer architecture class.

spencerflem4mo ago

Oh wow it’s by Tristan Hume, still remember you from EyeLike!

Graziano_M4mo ago

I recognized the name and dug around too. I played DEFCON CTF with him back in the day!

karmasimida4mo ago

I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll

lostmsu4mo ago

Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).

piokoch4mo ago

Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.

Aurornis4mo ago

When this was being used it was probably given to candidates who had already started the interview loop and been screened.

cjrp4mo ago

I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.

saagarjha4mo ago

Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…

greesil4mo ago

This is a knowledge test of GPU architecture?

avaer4mo ago

Kind of, but not any particular GPU.

The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...

But presumably similar principles apply.

benreesman4mo ago

It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.

This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.

saagarjha4mo ago

You can get pretty far without needing to care about this fwiw

1 more reply

sublimefire4mo ago

trishume4mo ago

Author of the take-home here: That's quite a good cycle count, substantially better than Claude's, you should email it to performance-recruiting@anthropic.com.

pshirshov4mo ago

Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.

potato-peeler4mo ago

What does clock cycles mean? Don’t think they are referring to the cpu clock?

NightBlossom4mo ago

I could only cut it down to 41 cycles.

pickpocket4mo ago

i cleared this one but didn't clear the follow up interview that was way easier than this

mayankd4mo ago

Problem solving is eternal!

zeroCalories4mo ago

It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.

pclmulqdq4mo ago

I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.

heavyset_go4mo ago

I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.

Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.

Does help that I have a very public web presence and portfolio, though.

3 more replies

Aurornis4mo ago

> I generally have a policy of "over 4 hours and I charge for my time.

If someone is giving a take-home problem that looks like you’re actually doing work for the company, that’s a different story. This problem is not actually work, obviously.

1 more reply

whateveracct4mo ago

4 hours continuous or no? I can't imagine finding 4 hours of straight focus.

2 more replies

djmips4mo ago

If you look at it as a puzzle game then it's not any different than the time you use to play other games.

aleph_minus_one4mo ago

> it's not any different than the time you use to play other games.

This assumes that the candidate has a lot of time for playing other games.

browningstreet4mo ago

throwa3562624mo ago

Care to elaborate the first part?

Did you apply for a position? Did they send you the assignment without prior discussion?

sealeck4mo ago

Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?

0x3f4mo ago

The expected time you spend on it is much less than the expected time they'll spend on it.

efilife4mo ago

you don't get paid for it

mips_avatar4mo ago

It’s kind of an interesting problem.

dhruv30064mo ago

I wonder if OpenAI follows suit.

rvz4mo ago

They should.

SinghCoder4mo ago

why is their github handle anthropics and not anthropic :D

alexpadula4mo ago

Looks rather fun!

mrdootdoot4mo ago

“In English, Data”

mannykannot4mo ago

I beat the target by deleting the parts that were causing the cycle count to be too high. /s

eisbaw4mo ago

submit and see if Anthropic accepts it

jackblemming4mo ago

Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

onion2k4mo ago

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

tmule4mo ago

Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)

jackblemming4mo ago

And they’re all dumber than John von Neumann, who cares?

1 more reply

muglug4mo ago

If they're hiring performance engineers then they're hiring for exactly these sets of skills.

It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.

Analemma_4mo ago

rvz4mo ago

> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

Good. That should be the minimum requirement.

Not another Next.js web app take home project.

saagarjha4mo ago

The solution was explicitly graded on creativity fwiw

j / k navigate · click thread line to collapse