Ask HN: Measuring the long-term benefit of interview code tests?

57 pointstraviskuhl4y ago71 comments

If your company does coding tests during the engineering interview process, (how) do you measure the long term effectiveness of the tests? Do you keep internal metrics comparing candidates' score to their long term impact/success at the company? If yes, what have you learned from the results and how have those learnings impacted your hiring process?

71 comments

kasey_junk4y ago

I don't know if my current company does, but when I first implemented them for a company I worked for ~15 years ago we definitely did.

At that company (which was a ~200 engineer, privately held, software company) we found a few things: - in person tests were less predictive than take home tests. - tests that did not provide automated test cases as examples were less predictive than those that did. - there was virtually no predictive power to 'secret test cases' that we ran without providing to the candidate. - no other part of the interview pipeline was predictive at all. Not whiteboarding, not presenting, not personality interviews, not culture fit testing, not credentials, or where experience came from, nothing. That was across all interviewers and candidates.

A few caveats about this: - this was before take home testing had become widespread and many companies screwed it up. At the time we were doing this it was seen as novel and interesting by candidates, not as just one more painful hoop they had to jump through. - we never interviewed enough candidates to get true statistical relevance. - false negatives were our biggest concern, they are extremely hard to measure (and potentially open yourself up to lawsuit). The best we ended up doing was opening up our pipeline to become less selective to account for it. This did not seem to reduce employee quality.

In a more meta-sense, that experience led me to believe that strict hiring pipelines are largely not useful. Bad candidates still get through and good candidates don't. Also, many other things have a much bigger outsized impact on productivity than if a candidate was 'good'. It turns out, humans do not produce at consistent levels all the time and things outside of what you can interview for make more impact (company process, employee health, life events, etc. all have way more impact on employee productivity than their 'score' at interview time).

hn_throwaway_994y ago

> no other part of the interview pipeline was predictive at all. Not whiteboarding, not presenting, not personality interviews, not culture fit testing, not credentials, or where experience came from, nothing

Did you test predictive power from individual interviewers? At a company I worked at previously we did, and this was the by far the best overall predictor: some interviewers just did a much better job at identifying those likely to succeed than others. Which can explain another reason why you didn't see much predictive power if you looked across those other items over all interviewers - the variance between interviewers essentially "swamps" any smaller differences between those interview techniques.

Note this didn't surprise me that much, as you see this dynamic in lots of other "person-to-person" endeavors. For example, when looking at whether one type of psychotherapy intervention is better than another, most of the data that I've seen shows that by far the most important factor is the skill and "match" between therapist and client, far more important than any individual modality.

kasey_junk4y ago

We did. There were small differences between them, same with what questions they asked. But nowhere near as predictive as the code work. I suggested but was never able to get approved, removing interviews entirely.

Again, we didn’t have enough data points for real statistical validity so it could be that, but I became convinced that it didn’t matter who was interviewing or the format of the interview. Some candidates are good at interviewing and some aren’t but that didn’t hold to the job.

leeoniya4y ago

> there was virtually no predictive power to 'secret test cases' that we ran without providing to the candidate.

this brings back some unpleasant memories of a take-home i got from a FAANG.

basically i was given a loose spec to implement, with no real data or test cases (and was told that none would be provided when i asked). after submitting my work i received a terse rejection with 0 constructive feedback for my 6hrs of work. uncool.

kovac4y ago

In one interview, I was given a timed hacker rank problem with a screen share with 2 interviewers. The sample tests passed and the real tests passed except for 2 (from what I remember) that timed out on large data sets. Before the tests were run, I already highlighted the part of the code that's the bottleneck and asked if I could copy the code to Visual Studio (the test was in C#) because the standard lib has a data structure for this use case that I hadn't used in a long time but I couldn't get the code to compile on hacker rank. I wasn't allowed to use the IDE and I was also denied access to the standard lib documentation (in front of then through the screen share). I couldn't implement the data structure within the time limit. I failed the interview. I still wonder what the point of that test was.

1 more reply

impute4y ago

Same experience. This is why I will no longer do take home tests that take more than 90 minutes or look like they'll take more than 90 minutes (even if the company misjudges it).

The only exception I've made is if the company pays for the time.

1 more reply

drBonkers4y ago

I've recently encountered similar assessments. I asked for feedback or the test cases but got none. What do you think the best option is to learn from the projects?

1 more reply

908B64B1974y ago

The problem with keeping these stats is that it only tracks engineers that were hired. I don't think coding interviews are a good predictor of performance, and that's not why I use them.

The point of a coding interview is to eliminate, as fast as possible, people who simply can't code. I'm being completely serious here. They can even have a CS degree (or will claim to but if you look closely they were in an easier program to get into and took CS electives) but cannot write a simple program on the board in an hour.

It's also why I don't like take-homes. First it's trivial to cheat (I don't mean lookup stuff online, just flat out have someone else do the work) and because of that the final stage would still have to be in-person whiteboard (or pair programming over Slack but still have an engineer spend 40+ minutes with the candidate).

gcheong4y ago

That was the purpose of the original fizzbuzz but for whatever reason it seems to have morphed into “spend all your spare time on leetcode so that you can answer whatever arbitrary problem is thrown at you” and they have the audacity to call that a meritocracy.

singron4y ago

We used the same takehome for years, and eventually there were a few solutions online that were easy to find. But for some reason, they all sucked, so we never had to worry about unqualified candidates copying them.

nefitty4y ago

The clear requirements of take home tests make them my favorite. They allow me to express how I work: get a list of reqs, walk away and think about them, make some decisions on directions, then let the code lead me.

I strike the style required. I capitalize on opportunities to make decisions I can discuss. "I used tape instead of jest because this example product will be distributed to many developers. The reduced API-surface area keeps us focused on the how's not the what's."

I tone that down if the role seems more rote-work like, at which point I try to highlight my ability to solve problems and learn quickly. For example, a comment above some network call: "// I was getting a cors error and found out I can run my own proxy for this"

handrous4y ago

Trouble is, unless more of the industry starts doing them so they're unavoidable, I'm going to skip companies that put these anywhere other than at the tail end of their process.

I'm not putting in half a day of work for zero pay to help you with your first-pass weed-out phase before we bother to make sure we align otherwise and this looks like a good fit. Thanks, bye, next (employer) candidate.

3 more replies

errcorrectcode4y ago

I think we should use interviews for basic screening purposes only and skip determining "who is awesome from just signals". Instead, shift to a flexible hours paid trial period where potential colleagues get a better assessment. Measure by doing and interacting around doing, not by guessing, hazing, trivia, interrogating, or whiteboard hand-waving.

kasey_junk4y ago

The issue with paid trials are 2 fold:

If you are talking about short term trials many devs are bound by anti-moonlighting employment agreements that either outright bar working for someone else or require notification.

For long term trials you severely limit your hiring pool because that is effect temp-to-hire which many devs simply will not do.

2 more replies

killingtime744y ago

Why would I go for a job with a trail period when there are plenty others that give me certainty immediately?

alberth4y ago

Very interesting. Question:

How did you measure the candidate once hired?

What factors were indicative of a "good" hire vs. a "bad" hire?

kasey_junk4y ago

We compared their performance review scores. I was always leery of this given how fraught performance reviews are, but that’s how the company judged employee ‘worth’ so it made sense to align there.

1 more reply

lbriner4y ago

We don't use coding tests in this way. We use coding tests as a screening process to ensure the candidate is in the correct ballpark.

If we are recruiting a senior, we would expect them to easily complete basic technical tests. If they are more junior we might use them only as an indicator of their ability.

I don't particularly expect a strong correlation between how well they did in the tests and their long-term ability since their value is made up of many things, only one of which is their ability in the tests.

meheleventyone4y ago

Do you do it for the interview process as a whole?

lbriner4y ago

Yes. I think we all know that interviews are not perfect so we won't be overly strict on anything. Do we think they will get on with others? As long as they are not a definite "no" then that's fine. Do they seem interested in the company? Same thing.

There is only time I was still unsure and didn't want to waste the candidates time so rather than telling him sorry, I set him a paid coding test to develop a microservice in order to judge his style, how long he took, what questions he asked etc. I didn't think the result was good enough but because we paid him, we parted on good terms and he had some good feedback.

rednerrus4y ago

We do this. We have a scale of how in-depth we expect people to get based on their level. We ask candidates questions about specific projects they've worked on and we ask them to be more specific until we feel we have a good understanding of how deep and wide their understanding is. We dig deeper for more senior folks. It seems to work pretty well.

cap10morgan4y ago

In my experience "Does your company measure the long term benefit of X?" is 99.99999% "no" for any X.

1 more reply

psadri4y ago

I'd like to point out that success in a role depends on more factors than the technical interview.

I have found that investing the time to correctly onboard new team members makes a huge difference. Correctly onboard an average/good hire and they go on to produce solid output and often thrive. On the other hand, you could have a great new hire but because of no/poor onboarding they "sink" instead of swim.

vannevar4y ago

The company would have to be fairly large (>100 employees) and long-lived (>10 years) to generate an amount of data with any hope of statistical significance. Employee "success" depends on many factors, and an employee who can seem to be a failure in the short-term may end up becoming very successful (or vice-versa), simply because of external circumstances---the nature of the projects, the clients, colleagues, etc.

kqr4y ago

Setting the bar at Fisherian statistical significance is letting perfect be the enemy of good.

This seems like one of those occasions where improving your reliability by just a few percent (even if far from statistical significance) can massively reduce costs in the long run. (Maybe Kahneman even used interviewing as a concrete example of this in his latest book?)

boldslogan4y ago

and maybe a follow up question (to measure the false negatives)

Do you check the applicants who were denied based on their test and see where they ended up working at. E.g. you are a mid tier start up who rejects someone who ends up working at amazon as a high level engineer – do you mark that a failure?

9999000009994y ago

Let’s think about this for a second, if I apply to be a mid-level engineer at Billy Bob’s software development firm, but I’m capable of getting a job as a senior engineer at Amazon. Odds are I’d only stay at Billy Bob’s software up until I’m able to get my Amazon job, considering on boarding a software engineer can easily take six months, that means you only get six months or so of actual work out of this person. If that, they might just work at Billy Bob’s for three months until they get their FAANG offer and then just leave Billy Bob’s off the résumé

francisofascii4y ago

We would mark that as a success. Someone good enough to get into a top tier place, wouldn't last a year at our shop. :)

daviddever23box4y ago

I'd be careful to call that out as a negative; if the culture fit wasn't right, and the candidate would have been a net negative to the team, it shouldn't matter where they end up next, unless (of course) the candidate that was actually hired ends up being an even worse fit (ergo the need to fix your hiring process).

RNCTX4y ago

> I'd be careful to call that out as a negative; if the culture fit wasn't right, and the candidate would have been a net negative to the team, it shouldn't matter where they end up next

I'd be careful to presume you can know these things from an interview.

> unless (of course) the candidate that was actually hired ends up being an even worse fit (ergo the need to fix your hiring process).

Total lack of self awareness in the corporate world really is an amazing thing to behold. I suppose this is "iterating" (in HR speak, not code speak): taking a set of criteria which generates a wrong conclusion, and then applying all that to ancillary things to find more wrong answers.

kspacewalk24y ago

On the other hand, "they would not have been a good fit" sounds suspiciously like a blanket, non-falsifiable denial of failure. I other words, bullshit.

3 more replies

daviddever23box4y ago

For developers, coding tests that include deployment / infrastructure components (i.e., deploy your solution to a cloud container, or, build and compile your solution for desktop platform testing) are uniformly consistent with long-term impact / success. Problem solving at the algorithmic layer may be inversely correlated to success, if a candidate lacks a production skill set.

Unless one's focus is research and development, there is a non-zero cost to training for production skills, so it's best to start with someone who understands the delivery process.

Linear metrics are probably less useful, inasmuch as it will become rather obvious as to which employees are self-starting and work well with others, versus those that require motivation or are staunch individualists.

chaosharmonic4y ago

As a developer with an arts background (studied music, happened into tech work after college through years of mostly self-taught hobbyist tendencies) I'd agree that the design of these tests themselves is a factor that's thoroughly understated.

Timed algo challenges encompass a slew of antipatterns in terms of how good code is actually written and shipped. To begin with, pitting someone against a clock and hidden test cases (and a foreign editor) is actively optimizing against solutions that are readable to other human beings -- or to the person writing them, a year from now. The nature of running them in a browser means that it can't evaluate a person's capacity to actually use tools outside of core language functionality. Never mind that building the entire exercise around predetermined test cases precludes any way to gauge whether the person taking it has an understanding of writing tests.

And that's assuming your test environment doesn't add obnoxious and arbitrary restrictions of its own. Like telling you that using documentation is cheating. (Btw imagine listening for ctrl+t here, but not ctrl+n.) Or offering you "the language of your choice," but then throwing API call exercises at you while limiting your choice of JavaScript runtimes to a bare installation of Node -- the only one still in active development, out of a list that also includes every browser you would use to actually access the test -- that doesn't support Fetch.

necovek4y ago

With the original question being about measuring correlation between interview performance and actual long-term performance on a job, I'd love to see the numbers you are basing your opinion on regarding testing for "production skills" and long-term job performance.

daviddever23box4y ago

Go ahead, show me how team interaction and declared responsibility can be measured unambiguously.

1 more reply

presidentender4y ago

Are you saying this because you measured it, because you have anecdotal evidence that it is true, or because it is something that you believe?

kqr4y ago

The more fundamental question: is your company meaningfully able to measure the long term impact/success of its employees? If so, how?

The submitted question seems to just brush over this aspect, but so far when I've tried to evaluate interviewing techniques that has been the primary obstacle; people just can't agree on what success means once employed, so anything that tries to correlate interviewing to that will be an equal amount of junk.

poulsbohemian4y ago

I think my favorite story of code tests was where one interviewer presented the test, gave me 24 hours to complete it, and I was then supposed to be "graded" by second team member. The second guy obviously didn't understand the requirements of the coding test (despite presumably receiving the same written instructions I received), so rejected me outright. Which I guess kinda gets to my thinking on coding tests, where you often learn a lot about companies by the crappy "tests" they think have merit.

I interviewed hundreds of technical people in my career, across dev, test, and ops skill sets. I saw limited correlation between tests and aptitude. If you talk to someone about a project they've done, you know pretty quickly:

1) Can they communicate technical ideas? 2) Can I develop a rapport with this person and work together? 3) Do they understand what they built? Can they talk about tradeoffs they made? Did they learn anything from the experience?

A fizz buzz test isn't a terrible idea, but you also have to have an interviewer that understands how to administer it within the wider context of the interview. If the interviewer themselves doesn't understand it, they aren't qualified to actually administer it.

dreen4y ago

There is no score or measurement. The task is to write a stopwatch in any technology you want and explain it along the way. Then we put in some bugs and ask for troubleshooting. It's all about the approach to the problem.

andrew_4y ago

I've never worked for a company that did (18 years in the industry this year). Of the 8 companies I've worked for, only one had interviewing figured out, and they didn't track or measure metrics on coding tests, challenges, etc. They did allow the challenges to evolve and they were tailored to the position that the tests/challenges were for.

AnotherGoodName4y ago

The big FAANG do for what it's worth. They have entire ML pipelines looking at hiring. The following isn't about interview effectiveness but is one example of the analysis done:

https://catonmat.net/programming-competitions-work-performan...

xeromal4y ago

As long as we keep finding good people and are not understaffed, it's working for us. Not more metrics needed than that.

ipnon4y ago

Not empirically, but my manager focuses primarily on the engineering expensive for our team and potential hires. This results in explicit feedback gathering, modifying our process accordingly.

We have short, standardized, broad interviews. We look for what can be added to the team rather than poking holes, and we're still trying to improve.

Aeolun4y ago

We don’t do coding tests at all. We do one 30-60m interview that covers some general tech questions and motivation.

So far we’ve hired 7 decent and 3 great people. No truly bad people have made it through that pipeline yet.

I can’t say anything about why, and I’d be prejudiced in any case.

ben0x5394y ago

Do the 7 people you've described as not great read HN?

yupper324y ago

What is motivation in this case?

Aeolun4y ago

Things like why you want to change positions, what motivates you, what tech do you like and why.

nonameiguess4y ago

The only way this could even conceivably be done in a scientifically valid way is randomized controlled trials, which would mean not giving the same interviews to all candidates, which is only possible if hiring at a large enough scale as to even be able to sample meaningfully from multiple "interview type" groups, and it would of course require it to be legal to give different interviews at random, which I'm not sure is true. I guess as long as it's actually random, you're not discriminating against any specific group, but it isn't exactly fair and you risk killing goodwill of your employees when they realize you're running experiments on them.

Of course, it's really not possible at all to do this at the level of rigor expected of, say, clinical trials. Each new hire will know what type of interview you put them through, and there is no reliable way you can prevent them from telling others.

a_c4y ago

I would say anything having indirect correlation has no easy way to measure. Ultimately a company is either looking for product market fit, customer growth or revenue/profit/cash flow depending which stage the company is in.

On top of being hard to measure, the data points generated through hiring is just too few and the data collection process is too long and subjective

Just ask your team if they like the new hire, can they make progress together. Things like do you like working with the new hire? Is the new hire bringing in new insights to the team? Is the new hire easy to work with? Is the new hire learning new things.

And most importantly, can the team let go of mismatch fast enough. Overall I would say it is just not worth it in measuring hiring.

nitwit0054y ago

Of course not.

However, we do hire some contractors essentially without an interview, and it is fairly apparent that's a bad idea.

Consultant324524y ago

We tried that at one company I worked for and it worked well enough. Our contract with the consulting firm said if we dumped the contractor within 90 days we didn't pay a cent for any of their time. This resulted in the consulting firm only sending us candidates that had good reviews in prior engagements. And good reviews from prior engagements strongly correlates with good reviews in future engagements.

ab_testing4y ago

Can you tell me which consulting company was that. We have a few contract positions where we can use consultants .

1 more reply

maxgfaraday4y ago

The main thing that matters is training managers properly. Training management to be clearer about how they communicate goals and how transparent they are. The fault is not with candidates. Making sure a candidate can communicate clearly and effectively and has some passion for the position is all that you can really do at the interview level. The rest is quite frankly having better management and a culture of being helpful. Metrics on your org should be about how clear are the processes and planning toward goals and how well do they get communicated and executed. I worked for a MAAAN company and this one didn’t get it right. I figured they just made the decision that it is better to crank through people than actually grow them - since they were never short of candidates. This was pretty clear from their promotion culture and assessments that rewarded selfishness. Bottom line... train managers. Build the scaffolding to grow competent, empathetic, managers. Communication and clarity and empathy wins over everything else. F** programming test hazing. Commit to the people in the organization. Done.

j / k navigate · click thread line to collapse

71 comments

kasey_junk4y ago

I don't know if my current company does, but when I first implemented them for a company I worked for ~15 years ago we definitely did.

hn_throwaway_994y ago

kasey_junk4y ago

leeoniya4y ago

> there was virtually no predictive power to 'secret test cases' that we ran without providing to the candidate.

this brings back some unpleasant memories of a take-home i got from a FAANG.

kovac4y ago

1 more reply

impute4y ago

Same experience. This is why I will no longer do take home tests that take more than 90 minutes or look like they'll take more than 90 minutes (even if the company misjudges it).

The only exception I've made is if the company pays for the time.

1 more reply

drBonkers4y ago

I've recently encountered similar assessments. I asked for feedback or the test cases but got none. What do you think the best option is to learn from the projects?

1 more reply

908B64B1974y ago

The problem with keeping these stats is that it only tracks engineers that were hired. I don't think coding interviews are a good predictor of performance, and that's not why I use them.

gcheong4y ago

singron4y ago

nefitty4y ago

handrous4y ago

Trouble is, unless more of the industry starts doing them so they're unavoidable, I'm going to skip companies that put these anywhere other than at the tail end of their process.

3 more replies

errcorrectcode4y ago

kasey_junk4y ago

The issue with paid trials are 2 fold:

If you are talking about short term trials many devs are bound by anti-moonlighting employment agreements that either outright bar working for someone else or require notification.

For long term trials you severely limit your hiring pool because that is effect temp-to-hire which many devs simply will not do.

2 more replies

killingtime744y ago

Why would I go for a job with a trail period when there are plenty others that give me certainty immediately?

alberth4y ago

Very interesting. Question:

How did you measure the candidate once hired?

What factors were indicative of a "good" hire vs. a "bad" hire?

kasey_junk4y ago

1 more reply

lbriner4y ago

We don't use coding tests in this way. We use coding tests as a screening process to ensure the candidate is in the correct ballpark.

If we are recruiting a senior, we would expect them to easily complete basic technical tests. If they are more junior we might use them only as an indicator of their ability.

meheleventyone4y ago

Do you do it for the interview process as a whole?

lbriner4y ago

rednerrus4y ago

cap10morgan4y ago

In my experience "Does your company measure the long term benefit of X?" is 99.99999% "no" for any X.

1 more reply

psadri4y ago

I'd like to point out that success in a role depends on more factors than the technical interview.

vannevar4y ago

kqr4y ago

Setting the bar at Fisherian statistical significance is letting perfect be the enemy of good.

boldslogan4y ago

and maybe a follow up question (to measure the false negatives)

9999000009994y ago

francisofascii4y ago

We would mark that as a success. Someone good enough to get into a top tier place, wouldn't last a year at our shop. :)

daviddever23box4y ago

RNCTX4y ago

> I'd be careful to call that out as a negative; if the culture fit wasn't right, and the candidate would have been a net negative to the team, it shouldn't matter where they end up next

I'd be careful to presume you can know these things from an interview.

> unless (of course) the candidate that was actually hired ends up being an even worse fit (ergo the need to fix your hiring process).

kspacewalk24y ago

On the other hand, "they would not have been a good fit" sounds suspiciously like a blanket, non-falsifiable denial of failure. I other words, bullshit.

3 more replies

daviddever23box4y ago

Unless one's focus is research and development, there is a non-zero cost to training for production skills, so it's best to start with someone who understands the delivery process.

chaosharmonic4y ago

necovek4y ago

daviddever23box4y ago

Go ahead, show me how team interaction and declared responsibility can be measured unambiguously.

1 more reply

presidentender4y ago

Are you saying this because you measured it, because you have anecdotal evidence that it is true, or because it is something that you believe?

kqr4y ago

The more fundamental question: is your company meaningfully able to measure the long term impact/success of its employees? If so, how?

poulsbohemian4y ago

dreen4y ago

andrew_4y ago

AnotherGoodName4y ago

The big FAANG do for what it's worth. They have entire ML pipelines looking at hiring. The following isn't about interview effectiveness but is one example of the analysis done:

https://catonmat.net/programming-competitions-work-performan...

xeromal4y ago

As long as we keep finding good people and are not understaffed, it's working for us. Not more metrics needed than that.

ipnon4y ago

Not empirically, but my manager focuses primarily on the engineering expensive for our team and potential hires. This results in explicit feedback gathering, modifying our process accordingly.

We have short, standardized, broad interviews. We look for what can be added to the team rather than poking holes, and we're still trying to improve.

Aeolun4y ago

We don’t do coding tests at all. We do one 30-60m interview that covers some general tech questions and motivation.

So far we’ve hired 7 decent and 3 great people. No truly bad people have made it through that pipeline yet.

I can’t say anything about why, and I’d be prejudiced in any case.

ben0x5394y ago

Do the 7 people you've described as not great read HN?

yupper324y ago

What is motivation in this case?

Aeolun4y ago

Things like why you want to change positions, what motivates you, what tech do you like and why.

nonameiguess4y ago

a_c4y ago

On top of being hard to measure, the data points generated through hiring is just too few and the data collection process is too long and subjective

And most importantly, can the team let go of mismatch fast enough. Overall I would say it is just not worth it in measuring hiring.

nitwit0054y ago

Of course not.

However, we do hire some contractors essentially without an interview, and it is fairly apparent that's a bad idea.

Consultant324524y ago

ab_testing4y ago

Can you tell me which consulting company was that. We have a few contract positions where we can use consultants .

1 more reply

maxgfaraday4y ago

j / k navigate · click thread line to collapse