There's a lot of good insight in the article about the correct way to approach the problem, but asking anyone to come up with it on the spot is unrealistic. You have the benefit of having seen the problem before with time on your side to reflect on it. They haven't.
When I do interviews like this, I prefer to talk them through the problem together, like we were actual teammates working on a problem together. That more closely relates to life on the job, which to me is the point of interviewing someone.
The "is this person good enough for me" interview allows geniuses who are assholes through. I prefer to filter for good teammates.
I wouldn't expect everyone to immediately see the optimizations -- you can sort the days individually, you can drop all but a small number of pages per customer -- immediately, but I would be disappointed if someone couldn't be hinted there by the end of a 30 minute interview.
Failing to even get the O(n log n) solution tells me that someone should never have graduated.
Well yes, but that's not what this is about. It's about social pressure, not ability.
I could program this for you right now if I wanted to; most "optimisations" seemed the "obvious ones" to me, so I guess I would have passed. I could also program it in a hurry at ludicrous speed under pressure because all of production is down or whatever.
But ... I wouldn't be able to program this in an interview setting.
Last time I was asked a bunch of even easier things (in a timed interview, no less), where I literally have some exact code on my GitHub. I just couldn't finish it. Not because I can't "get" it, but because programming while two strangers are judging your every keystroke and commenting on what you're doing is just nerve-wrecking.
Or, basically this sketch: https://www.youtube.com/watch?v=kQa5NsdYSts
This is something I wonder about in my comment sidethread - to me, the natural solution is the O(n) one in O(n) space.
I see that the O(n log n) solution requires O(1) space to run. But it requires O(n) space to return a result! Is that actually better than O(n) time and O(n) space?
I think that rather, I would have done well because I know after doing hundreds of these both as the interviewer and the interviewee that using some sort of Map along with some vague rumblings of “space/time complexity tradeoff” is very often the solution for these types of questions. So I would have immediately gone there first by instinct.
Only the basics that are close to everyday programming work: write a for-loop, know what a Map/dict is, and Google for “how to read a file”. If a candidate can’t do that, they can’t really program. ChatGPT can probably code this answer.
The coding exercise will involve writing tests, deisgn of classes, working with floating point arithmetic, working across thread barriers, some fairly standard maths, and so on.
I actually can't say we're hired a bad candidate yet and they all report that the interview is a positive experience.
In my fairly long career I've been through about 10-15 interview processes. A few of them were quite toxic and those were the "just cram leetcode for months" type. I never want to inflict that on someone else.
If a colleague came to me to discuss a question this basic, I'd want them PIPed out. This is miles from the complexity of problems that require collaboration. It's a shell pipeline that I'd write with sort/uniq in less than five minutes. I just did it to make sure my estimate wasn't wrong:
sort <(sort -u <(cut -d' ' -f1,2 /tmp/day1) <(cut -d' ' -f1,2 /tmp/day2) | cut -d' ' -f1 | uniq -d) <(sort <(cut -d' ' -f1 /tmp/day1 | sort | uniq) <(cut -d' ' -f1 /tmp/day2 | sort | uniq) | uniq -d) | uniq -d
Maintainable? No. Done iteratively in less than five minutes? Yes.If your perspective is that it's unrealistic to ask anyone to come up with an answer to this question on the spot, you should take a long, hard look at yourself and the quality of your colleagues.
You didn't ask the question you were supposed to ask. :)
Senior in a perfect world sure but the reality is that for decades ppl have been desperate for talent and so you don't have to do much to get by.
I knew a popular principle engineer at a fortune 50 company who was loved by execs. He would check in 5 or 10 of the same file. Opposite of the dry principle.
> Occasionally, a candidate will think they’re done at this point. 90% of the times that a candidate was done without questioning the quadratic nature of that solution, the final outcome of the loop was No Hire. So that’s another signal for me.
I would have been one of such candidates. The author said they didn't like tricky questions and wanted to get a signal on how the candidate may approach real world problems. Well this is indeed tricky -- unless you drop a bunch of constraints in the beginning, for a real world project, I would just use all the resources I can access to finish it. I am not going to go the extra miles to optimize it in all possible ways. Premature optimization can be evil. I provided the solution, it works and meets all your requirements, then I am done.
Want me to make it fast/memory efficient? You have to say it. Forgot to mention it in the first iteration? No problem, cut me a ticket and I will see if I can sneak it into my next sprint.
It wouldn't even occur to me to go with the naive O(n^2) solution because it has such obvious drawbacks with large inputs.
And it's an interview question... yes, you're getting a No-Hire if you just leave it there. Although I personally would prompt you if you're not interviewing for a senior position.
In real life people do a lot of O(nˆ2) code without realizing, and usually it's just some unnecessary loop inside another loop. I want to know that you care about some things.
I don't understand why this has to be a red flag. What's wrong with just saying "it works, but can you make it faster"? As an interviewer, I say this a lot and I never judge candidates on this at all.
A realistic dataset for the problem they descibed could easily be tens of millions of records even if you're not a Google-sized site.
> Want me to make it fast/memory efficient? You have to say it. Forgot to mention it in the first iteration? No problem, cut me a ticket and I will see if I can sneak it into my next sprint.
It’s probably a lot easier to just not hire people who deliver poor quality work in the first place. Then you don’t have to worry about whether or not they can go back and fix it later.
"How large can the file potentially be" would be my first question. Depending on the size I might even throw sqlite, an RDBMS, or even a full text search engine at the problem. Or I might not, it depends on the actual scenario.
But if your description is to keep it simple I'm going to do that in good faith.
Me: Child, please go wash your hands.
Child: rinses hands under cold running water for 2 seconds
Me: With soap! And warm water! For more than two seconds!
Child: You want me to wash my hands with soap? You have to say it.
The soap is independently helpful, but it does occur to me to wonder how much of the additional value it brings is due to the automatic requirement to spend more time with your hands in the water, lathering up and rinsing off.
Me: Should I wash both hands?
Parent: Yes.
Me: Should I wash them in any order or the order is important?
Parent: In any order.
Me: Should I wash them fast or thoroughly? I need to ask as that may affect the way I'm solving this interview problem.
They fester and/or blow up regularly otherwise.
There are some features that will never be used at any significant scale.
E.g. "this is a one-off report" and "ok, how would you do it differently if we wanted to turn this into a feature at scale?"
It's great if an engineer gets there on their own, but there are so many tangents one could go on that I won't ding someone for picking the wrong one, or asking for the most relevant one.
I would expect seniors not to need that level of guidance on the job, with the caveat that expectations should be reasonably set in design docs, tickets, various bits of context, etc.
What if this is a one-off to produce a business report? Would it make sense to use programmer time to create an O(n) structure in memory, or just loop through the files line by line and let the CPU take a minute or five, or thirty? What is the programming language - something that has a library for this or something very low level where we’d read the file byte by byte?
If we’re dealing with the latter, a small amount of data, and a one off report, I don’t care at all in my work whether an engineer I’m managing somehow writes it in O(n^3).
It’s interesting how quick to judge the author is - ask this question for points, don’t even think about that, don’t mention arrays because they’re fixed size (despite implementations for dynamically allocated arrays totally existing and the candidate might be coming from that), and so on. Some humility would be nice.
Although I think what they wrote is very valuable, as this is how many interviews go. And I have to at least appreciate the author’s approach for trying to start a conversation, even if he still takes a rather reductive approach to evaluating candidates.
The attempts to circumstantially excuse away shitty answers goes against the desire to just not be shit.
The author here seemed able and willing to talk through constraints & issues. Their first paragraphs practically begged for it, as a sign of maturity. Rather than just excuse away shitty solutions, my hope is, even if you are not a super competent can-do coder, you at least can talk through and walk through problems with people that are trustable, to arrive at reasonably competent capable answers.
> About 80% of the candidates go for the naive solution first. It’s easiest and most natural.
The “naive solution” will be easier to understand and maintain. Why make it harder if it doesn’t add value?
But even more importantly, with a slightly better solution I don’t get woken up in the night once a week because some buffoon left behind a hundred of these little performance landmines that worked great when the table had ten rows on their dev box but causes an outage in prod the second we hit some critical threshold of data.
This takes all sorts of forms from people using quadratic time or quadratic memory to my personal favorite: pulling entire database tables across the network to do basic analytics.
The authors always have the same excuse, that “it worked good enough for what was needed at the time”, ignoring the simple fact that the version which wouldn’t have caused an outage would have been just as easy from day one.
Unfortunately people don’t think about actual engineering cost of “optimal” solutions. Engineering costs are part of operational costs and need to be juxtaposed against compute.
You can get a lot more mileage out of running the largest EC2 instance for a year vs hiring a junior engineer.
> In theory you could write a pretty simple SQL query, and sure, Big Tech companies have giant data warehouses where you can easily do this sort of thing. But for the scope of a coding interview, you wouldn’t want to. Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
My first thought on an actual implementation was, if this is a one-off request, to import it into sqlite. No need to set up a big system, and I think it would be easier/faster than writing those 20 lines of code. Also a hell of a lot easier to iterate on minor spec tweaks like the unique pages overall vs per day clarification. And probably less likely to have off-by-one type of bugs, since the simple logic is handled by the database itself. Bonus, it does handle the case where the dataset is larger than memory!
...
> I’ve actually expressed the problem in an ambiguous way.
So when other people do it, hard and tricky questions are bad, but when you deliberately set your candidate up for failure by withholding concrete information, that's clever and insightful. Got it.
Or more productively put: The author obviously enjoys tearing down simple questions with complex implications (one often does in the sw field) and reflects over their candidates, but seemingly lacks the self-reflection to understand what makes questions hard or tricky and why interviewers like to pick them.
I don't see the author's withholding of all problem parameters as tricky at all, but an attempt to accurately mimic the ambiguities of the real world to see what the candidate does with it.
Especially for more junior people it's selecting on confidence, because especially in a hiring setting they don't want to "seem stupid" by asking questions. I guess this is also a problem for more senior people. It's just a different setting than "actual job".
If you want to ask if anything could be clarified then ask.
To me the worst part is the nonsense rationale to argue away using a database to store and query this data. Taken from the article:
> In theory you could write a pretty simple SQL query, and sure, Big Tech companies have giant data warehouses where you can easily do this sort of thing. But for the scope of a coding interview, you wouldn’t want to. Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
The blogger's problem statement explicitly mentions the system needs to track business metrics. Business metrics include things like clickstream metrics and the task of analyzing business metrics is exploratory in nature and requires dashboarding. Only an utter moron would look at those requirements and say "hey, this is a task for logs dumped onto a text file". There is no way in hell that this task does not involve in the very least a database.
This is yet another example of a technical recruiter evaluating candidates, and rejecting them, because they did not presented the wrong solution to the wrong problem that the recruiter somehow convinced himself it was the right one for some reason.
Will the data fit in memory? Well, of course it will, it's an interview... oh you expected me to ask you anyway?
I should obviously load both files into the hashmap, that way it works for an arbitrary amount of files instead of just two... oh, you expected me to write a solution for literally the exact problem you stated without considering its practicality? Even though before you wanted the opposite, when you asked about the algorithmic complexity?
Guess I'm failing.
But you are right that optimizing specifically for two files seems wrong, especially as the data contains a timestamp, so you could simplify the problem and ignore the number of files completely.
> Now, given two N log files we want to generate a list of ‘loyal customers’ that meet the criteria of: (a) they came on ALL days, and (b) they visited at least L unique pages.
while keeping linear time complexity and
O(num records in first file * L)
memory complexity. It is not too different from the solution given in the article (just use 3 maps instead of two).That means that multiple files doesn't need out-of-core if the maps for one file at a time fit memory.
"explain select" is a cool source of interview questions :)
If you're consistently getting reliable answers and finally decide to build a system for these types of reports, clearly this guy's real world experience at Amazon's Clickstream product is going to be far more valuable than what ever anyone who is brand new to the problem can come up with, even if they choose the "correct" algorithm from the start.
Because I bet you that for most real world products that create more than a single fixed format report you actually want your data setup in a completely different way than what you initially thought. You'll probably learn for example that you want to aggregrate data per week instead of per day. or perhaps you want to link this data to an internal users database, or perhaps your boss wants a notification when new data is added. Or perhaps you'll learn that loading it into a single 1GB SQLite DB solves your problem without even needing to think about any algos.
My team builds out a lot of one-off projects for customers, and a lot of them don't need to be especially performant. That has always been the case for reporting/analytical projects.
You know what does make a difference though? Amount of development effort spent on the solution. The most performant solution can very easily cost the company much more in terms of development time spent. So in that sense, it's sub-optimal
Grep, uniq, wc, and a few others can be treated as pipeline data transformers to answer questions like this interview question. As long as you make some smart decisions about the order of operations, you can usually get performance on par with what you might write custom code for.
I would love to interview a candidate that can show they can use command line tools effectively.
Don't get me wrong - it's a totally fair question, frankly one I would have been happy to receive when I was interviewing for those roles.
I'm also a fan of whiteboarding coding interviews in general as a way of evaluating talent so no objections there.
There were just something about this specific question that just struck me as boring, souless, like who cares? I think my objection might be that it too closely resembles a menial task I might actually be given - something that I hope to God the upcoming LLM advances automates away.
"they don’t know the size of the data upfront", okay. I spend a scan finding the highest customer number and probably make some 10MB index-assossiated buffer. If I'm fancy I find the range and use offset indices to reduce the overall size. You already said it fits in memory and I'm not a distsrubuted programmer. Space is cheap in boring reality
I guess it's one of those cool brain teasers that gets you excited to use your skills from college. Not many get to in reality. Or they prefer other domain-specific skills.
A cleaner solution would be to load both days 1 and 2 into the same hashmap. Then you can iterate the map and count whatever condition you want.
My design was to create two hash maps, one for customer to a list of days and one for customer to list of pages, though after reading the article I realized my lists should really be sets. Then you can easily account for any change to the definition of a loyal customer, as all you need to do is use two O(1) lookups and then check the size of the lists. Easy, flexible, and little room for error.
Especially when the question scenario is generating "business metrics" which tend to see a lot of tweaking and iteration.
Having engineers who make contextually appropriate designs and architectures is at least as important as having engineers who are math-whizzes.
Any experienced engineer should have no trouble with it. There's no hiding here. - candidate just needs to deliver something working, something relatively clean, and be reasonably pleasant to pair with. No leetcode grinding necessary, though I have found that those who did well on this problem also generally got high scores from my colleagues who do ask LC questions.
It's really just a scenario with some mockup third-party API docs, where the applicant needs to write some paeudocode that checks different conditions, arranges the data, and ties together the different calls.
It might not be testing every possible skill the applicant has, but at least it's in-line with one of the tasks we actually expect them to perform regularly.
"Let us hire this person because they dint solve the problem but were a great conversationalist problem solver" - said nobody at a standardized hiring committee!
The first answer that popped into my head was a shell pipeline, "cat file1 file2 | sort -k [pattern for customer ID] | awk -f ..." where the awk part just scans the sort output and checks for both dates and two pages within each customer-ID cluster. So maybe 10 lines of awk. It didn't occur to me to use hash tables. Overall it seems like a lame problem, given how big today's machines are: 10,000 page views per second for 2 days, crunching each record into 64 bits, means you can sort everything in memory. If 10 million views per second then maybe we can talk about a hadoop cluster. But 10k a second is an awfully busy site.
I actually had a real-life problem sort of like this a while back, with around 500 million records, and it took a few hours using the Unix command line sort utility on a single machine. That approach generally beat databases solidly.
That is basically how everything worked back when 1MB was a lot of memory. The temp files were even on magtape rather than disk. Old movie clips of computer rooms full of magtape drives jumping around, were probably running a sorting procedure of some type. E.g. if you had a telephone in the 1960s, they ran something like that once a month to generate your phone bill with itemized calls. A lot of Knuth volume 3 is still about how to do that.
These days you'd do very large sorting operations (say for a web search engine indexing 1000's of TB of data) with Hadoop or MapReduce or the like. Basically you split the data across 1000s of computers, let each computer do its own sorting operation so you can use all the CPU's and RAM at the same time, and then do the final merge stage between the computers over fast local networks.
I've used the Unix sort program on inputs as large as 500GB and it works fine with a few GB of memory. It does take a while, but so what.
The data structures look sensible and it did most of what the interviewer wanted on the first try.
It did make the wrong initial assumption that we wanted 2 unique pages per day. When prompted with a clarification, it made a sensible fix.
When asked to optimize, it went for big hammers like parallel processing and caching. As opposed to saving memory by only storing one file in the data structure as the author discussed.
https://aider.chat/share/?mdurl=https://gist.github.com/paul...
Lots of good software engineers will have the instinctive knowledge of good or costly solutions without mapping it to BigO.
Also, what is funny is that, in my opinion, BigO is used and required by people that want to look smart but are not necessarily so. Because what we do with bigO is really limited. Almost nowhere the discussion will go further than O(1), O(logn), O(n), O(n2). Because after it becomes hard to understand maths. But in my opinion, algorithm complexity goes way beyond that when you use it in real life.
Maybe you don't work in the right places then.
I suppose this is the step where I become a "poor candidate". I think it's important to acknowledge changing client requirements at this point. Sure, loading both files in to memory is less memory efficient, but it's much easier to tweak this algorithm later if you do it. You can change to to count over 3 different days, or 2 days in a 5 day time period, or any number of other things. You can save some memory if you don't, but you'll arrive at a solution that is much less flexible.
I mean the real solution is to load all the data into a database of course, but even given the constraints of the problem I'd still argue for loading each entire file in to memory as the more general and flexible solutions, when our pretend clients inevitably change their pretend minds.
>you don’t need to actually keep every single page from Day 1 in the Map, just two, since the problem is “at least two pages” so a Set of size 2 or even an array of size 2 will use less memory than an unbounded Set.
And with this I think we've crossed over from the practical to leetcode. At this point you're asking the candidate to add a bunch of new code paths (each one should be tested) and make their solution a lot less general. Going from a pretty general algorithm that can be tweaked pretty easily to something super specific with a bunch of new sources of bugs.
>Or, if you’ve already determined that a customer is loyal, you don’t need to waste CPU cycles going thru the logic again next time you encounter that customer in Day 2.
No, please load it all in to your data structures properly, even if you "waste" a bit of time. All these weird little conditionals sprinkled throughout your code when you're ingesting the data are going to be sources of problems later. They might save a bit of memory, a few cycles, but they significantly increase the complexity of the code, and make a refactor of tweaks much much harder.
If this developer started doing stuff like that in an interview with me, well it would raise some red flags.
>If you want to make the problem even more interesting, you can add a third file. I will leave that as an exercise for the reader as well!
See, our imaginary customers imaginary minds did end up changing. Bet you wish you had loaded both files into memory now.
As someone going through this style of interview at the moment (but not having interviewed at Google, Microsoft or Amazon), two things jump out at me:
- If you're going to ask this question and get it done in 1 hour, does the code really matter? I'd argue that if you can get to a good or optimal solution, 99 times out of 100 you can write the code. If I got this question and didn't know better, I'd be stressing about writing the code within an hour. Knowing that we wanted to spend most of the time discussing the algos and data structures would be really useful to me. Maybe Google/Amazon/Microsoft interviews really stress this in their preamble, I don't know.
- The big "issue" I see with this question is that it relies on the interviewer knowing exactly how to steer the conversation. I think I could get to this solution with the hints, and the author seems to imply that it's ok to need a few hints. But an interviewer that doesn't know the right hints to give (or phrases them poorly) is going to turn this question into a train-wreck. This isn't an issue for the author, they clearly know this questions backwards and forwards. But giving this question as a 'standard' question that others will deliver? I think it could easily end up being too conservative and cutting out a lot of otherwise smart developers.
In general, that's my criticism of this style of question: they all claim that they're about 'seeing how you think'. But I think expecting interviewers to be able to elicit a conversation that really shows 'how a candidate thinks' is much more on the interviewer rather than the interviewee. You're expecting people whose primary job is writing software to be really good at delivering interviews.
Instead, you're going to have candidates who most of the time will do well if they can pattern-matching against problems they've seen in the past, and poorly otherwise. I can see how questions like this seem good on paper, and I'm glad this question works for the author. But it's the combination of interviewer and question that makes it effective, not just the question alone. A better title for this post might be 'My favourite way of interviewing candidates', because this post is mostly to do with the author's mental model of how to run an interview with this question.
... I'm guessing I didn't get the job.
I feel that author filtered out lots of great candidates over this problem, which might be something to pause about. On the other hand, interviewing to get a good signal is indeed a tricky business, so I can sympathize.
You can do a little better than that. Each item in your map has exactly 3 states:
- We’ve seen this customer visit one unique page with (xx) url on the first day
- We’ve seen this customer visit two unique pages - but only on the first day.
- We’ve seen the customer visit one unique page (xx) and they’ve visited on both days.
In the second state you don’t actually care what the URLs are. And I think the logic for the 3rd state is identical to the logic for the 1st state - since you only add them as a “loyal customer” by visiting them again on the second day. So I think you can get away with using an Option<Url> to store the state instead of a Set or a list. (Though I’d probably use a custom parametric enum for clarity).
It’s a great problem to model using a state machine.
I don't mind O(n²) if I get good result but Amazon's search seldom gives me good results.
Same with Microsoft's search in the start menu. Doesn't find Excel if I type "exc".
I don't like the trick of failing candidates if they don't ask a question. 90% of this style of interview want candidates to rifle through solutions. If you want to talk about requirements, be explicit about it.
I'm really amazed that this "best interview" question really just boils down to leetcode for a _Senior Staff_ level interview. I don't know about y'all, but the _Senior Staff_ and _Principal_ developers I've worked with aren't wasting their time of shit like this. They're ironing out requirements. They're working with stakeholder. They're architecting systems. They're figuring out how to deliver the value the customer wants - and they're ensuring that it's actually the customer wants.
-----
There's a place for performance, but the fast running turd is still a turd.
A good interview should involve more than just a coding problem. But it should absolutely require at least one coding problem. It’s mind boggling the number of “senior” people with good resumes I’ve screened out in interviews over the years because, simple as a problem like this is, they really had no idea how to even start solving it.
I don’t know about the poster, but when I’ve done interviews - especially for senior people - there are a lot of different types of assessment I’d want to do before hiring them. I’d also want to assess their social skills somehow (eg get them to present to the team about something interesting). And ask some high level systems architecture questions, talk about their background, and more.
My username at a large org is my first name. In any situation where someone wants to link to me or mention me, typing my username brings up a list of every person at the org with my name, alphabetically. My last name inevitably sorts me down toward the bottom.
You would think that an exact literal username match would have priority, but no. Typing any prefix of my name similarly sorts everyone else before me too.
What do you get? I tried just now, and `E` gives me Excel. In fact, I typed "Esc" first on accident, and I got something different for "Es" but "Esc" gave me Excel too.
“Load to a relational store and use sql” would be a reasonable answer that, I’m sure, would be acceptable in most cases.
I only know how certain SQL engines implement queries because I was tasked at that point to increase the speed of the queries, and went deep into the debugging level detail to understand the exact cost of each action.
But I haven't done that in 7 years, and couldn't tell you much more than the tools used to figure it out.
I cannot remotely understand the requirements people make up for our jobs that have nothing to do with doing our jobs.
Sorting the files can be a logarithmic operation with constant memory.
A great candidate would know that this is a log-linear operation O(n*log(n)), not a logarithmic one O(log(n).Very low memory and compute requirements.
Map<CustomerId, Metadata>
Where Metadata is a dateVisited bitset plus a bloom filter (32/64 bits depends on how accurate you need it to be).
> Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
Because this is a question about getting the right data, and SQL Databases are...extremely good for filtering, sorting and grouping... data. Besides, every page visit from every client is a unique observation, and the principle of...tidy data suggests that every observation use a database row.
Why solve this with 20 lines of code, when you can solve it in a 4 line SQL query?
It creates a team of people who are good at tests, and good in testing environments.
However, there are so many things that make an engineer great that have nothing to do with how they solve problems but who they are, how dedicated to improvement they are, etc.
But they may not be someone who:
-thinks quickly on their feet - finds this type of situation tolerable - may have disabilities one can’t see that would make this kind of interview difficult for them - have personality challenges or anxiety in social situations that make interviews like this impossibly difficult
The list of reasons that “whiteboard testing interviews” don’t work well is long.
I don’t think there’s anything wrong with this approach if that’s the kind of organization you want to build. But it does tend to create homogeny and act as a gatekeeper for “those who do not fit in.”
Some of the very best engineers I have ever hired would never make it though this interview. But they were amazing engineers who did world class work.
How is the author determining which candidates are great? Do "great" candidates answer the questions the best, or is the interviewer following up 1-2 years after hire and examining the impact of the person?
Great candidates aren't those who can answer DS & algorithms questions the best, but it seems that the author thinks that way.
That seems overcomplicated. For each customer on day 1, you either have multiple pages or you have a single page. If you see them on day 2 and they had multiple pages on day 1, then they are loyal. Or if they had a different page on day 1 than day 2, they’re loyal. (Or two different pages on day 2, but this comes along for free.)
So the data structure can be:
map<customerid, (MultiplePages | pageid)>
Where MultiplePages is choice in a sum type that doesn’t store any associated data. Or you can do:
map<customerid, optional<pageid>>
Where the none state of the optional means there are multiple pages, but this is a bit of an odd use of an optional.
For reference: http://www.leancrew.com/all-this/2011/12/more-shell-less-egg...
If I used this question I'd add to it.
"Pretend you have to do the work on an Arduino uno, which has very little resources. Your uno can request input and produce output from the disk where these are stored at whatever offset you wish. The log files are 100GB each and sit on a desktop computer with a modern Linux on it. Each log line is 512B. You can create files if you need to through some unspecified protocol with the desktop computer. But the desktop computer must be dumb. It will only write to and read from disk. You can send it any disk system call you wish. Step 2; Now do it without sorting or something absurdly slow"
The point is to ask for actual creative solutions instead of the pattern recognition problems that most of these problem formats are.
You want something that a weekend of drilling won't change the result of
1. Find names appearing in both log files. From this create a Set of customers that appear in both. In python it's just creating a set from the first file, creating another set from the second file and unionizing them. O(N)
2. concatenate both files to treat as one. create a Map with key: Customer and value: Set[Page]. This is basically iterating through the log, when you see a customer append the customer_id to the map and add the page to the set if it already exists. O(N)
3. Filter the map for all customers with length(Set[Page]) > 1. To get the Set of all customers that visited more than one page. O(N)
4. Combine the sets of customers who visited multiple pages with customers that appeared in both log files into a new set. O(N). You can do this with, again the union operator in python.
The irony is I'm more able to do this in python then I am in SQL. For SQL stuff at work I just look it up. I never remember the exact syntax. Total runtime O(N) total memory O(N)
This is just your basic hashmap stuff for fast look up and aggregation. Usually fang questions are harder than this.
But most jobs are not log file processing.
It's ridiculous to generalise this sort of thing:
"We are cooking soup here at Amazon Soup Kitchen. My favorite interview question is to ask candidates to bake a cake, that's the real test of any cook."
Finally, it's worth mentioning, while the question + answer might correlate well with the hiring decision there's no mention to how well it predicts future performance. That said, there's a survivor bias at play so using it against performance might be iffy.
A big part of this job is dealing with ambiguity and communication. If you feel the requirements of your task aren't clear, then go to the person who made the task and clarify them. What's the alternative, exactly? Staying silent and waiting? Wasting time implementing the wrong solution?
I've already explained simple and obvious scenarios where the context can impact the candidates in sub-optimal ways.
If someone says, "Solve this" and the canidate attempts to do so, but that's not what was really expected? That's trickery (and foolish). On the other hand, if the interviewer wants, "Here's a problem, let's discuss..." then *that* is what the interviewer should lead with.
One of my teams was very surprised that I rejected a prd for being too vague and put my foot down that it will not be picked up until specific questions are answered. They were, I would not say meek, but resigned to the inevitability of product managers pushing poorly thought PRDs and not having any say in the matter.
I took my time training them to say no and spot ambiguities, and I like to think they have all become better developers and product managers for it. I always pick questions that have more than one obviously correct interpretation to see if the candidate notices it.
The idea that you trust your interviewer to provide a directly actionable question is strange. I would expect that in campus hirings, and for entry level fresh grads, but not to anyone with even a year of experience. At senior levels, it becomes more and more important to spot ambiguities and clarify them before they result in misunderstandings, wasted efforts, and worse. I trust a good interviewer to have a question that can provide them useful data points on candidates experience, skill, thought process and attitude.
Whether spotting ambiguities in the question has any correlation with future performance is harder to answer, but methodical people with attention to detail are preferrable to the alternative.
If the candidate comes from an environment where questions are penalised, they would be a bad fit for a team that values and expects questioning. It is somewhat unfair, but either way, the interviewers are selecting for their preferred qualities.
I do agree, the real job would habe to deal with ambiguity. But this is the interview. The interviewee has a completely different mindset going in. "Playing games" to see how they respond...that's for the likes of the NSA, CIA, etc.
"We can't find good candidates"? Nah. Your hiring process sucks. Get a mirror.
What’s the old xkcd: “communicating poorly then acting smug when you’re misunderstood is not cleverness.”
I think the hard part is recognizing that a problem is ambiguous. Telling someone that a problem is ambiguous kind of defeats the point. IMO, it’s not about reading someone’s mind, but recognizing that there are multiple interpretations to what somebody has SAID. That seems less like a mind-reading technique and more like, you know, a communication skill.
I have gotten lots of ambiguous problems during my career, it seems only fair to have them appear during an interview.
Great candidates treat software like a business with changing requirements and code that is read by multiple people, poor candidates treat software like a math challenge where the only goal is to use as few resources as possible.
This is what happens with Codedorces/ACM-ICPC. Suddenly everyone is hyper driven to crank them out day after day under pressure and much more interesting fields databases or turning a business idea into a usable app get neglected.
Lets not fool ourselves no one is going to solve hard medium LC under time pressure in 30 minutes unless they’ve seen a similar problem before which leads to hiring worse engineers who pass interviews.
Once a metric becomes the goal it ceases to be a good metric.
Nothing about O(blah) though, these are way-too specialized / optimizing lands.
that said, coding != thinking..
Some people cannot think of a solution at all, 50% go for 3 loops one-in-another.. copy-pasted 2x2 times, few do the ~~simplest 2 maps with 2-3 funcs.. and once a "master" invented a class with quite a few methods contaning copy-paste inside :/ Probably one can do some analysis over who goes which way and why but i never bothered.
It is also a good ground to show the different ways to add key-value to a map if it's not there - and their pros/cons (one can hook some O(blah) stuff here. i don't. There's just simpler vs faster, with variants).
And yes, having an (subtly) unclear requirement (that should be identified and asked about), is important part of the learning - that's 90% of cases in life.
Personally I prefer something like fizzbuzz which is a pure code question, it applies to candidates of all levels and tells you if they can reason through problems.
Yet he still thinks it is not a tricky question.
But great article and I learned something from it.
There should be process for clarifying tickets with the team because you don't just drop tickets on developers "out of nowhere" and expect them to ask questions especially if company culture is "drop the ticket on dev let him figure this out".
Someone has to either make ticket written in detail so you can drop it on dev without dev needing to ask questions OR make refinement where you get team to ask questions to have input from multiple people because single person is not able to understand all the details of the system.
Here's my reasoning for the problem: we have 2 files, one for each day, and we want to see who came both days on different pages. That's a job for join, with some sort|uniq in the middle.
- for each page, we want unique CustomerId -> PageId "mappings", but in UNIX land that's just 2 columns on the same row:
cat dayX | awk '{print $3 $2}' | sort | uniq
- now I have two lists, I join them join day1_uniq day2_uniq
this gives me, for each customer, its id, then all its pages, on the same line. Customers who came only on one day are not in the output.- now I want to see if there are at least 2 pages and those 2 pages are different. There's no easy UNIX way to do this because it's all on a single line, so we'll use awk to build a hashmap. We don't need to build a map of all pages, we only need to see if there are at least 2 different pages
cat both_days | awk '{for (i = 1; i < NF; i++) {pages[$i] = 1; if (length(pages) == 2) {print; next}} }'
(Note: length() is not posix but gawk)Result: a list of all customers having visited at least 2 different pages on both days. Everything is dominated by the initial sort. I haven't ran this, it's a rough draft.
What is the problem of using later job interview stages as validation for early stages and what would be a better metric for validation.
but on a serious note, the good solution is sort of obvious and in general you encounter much more interesting problems on a daily basis, but than again I am working in Clojure where working with data structures is much easier and straightforward than in Java.
> Did I mean 2 unique pages per day or overall?
The author needs a course in logic. "They visited at least two unique pages." is not ambiguous. Visiting page A on day 1 and visiting page B on day 2 makes the sentence true.
Interviews are just not the same thing as real life requirements gathering, so people's thought processes will not be the same. Even if you try to roleplay as someone that doesn't understand how to state requirements precisely, all of the normal procedures and thought processes for sussing that out are compromised due to the inrerview setting. And your ability to assess those processes are compromised due to how familiar you are with the question and how much you've deconstructed it.
It's not the time to be tricky (though somehow simultaneously believing it is and is not a trick question). There's so many more interesting things that could be gleaned from a person's interview performance that "is this interviewer playing fuck-fuck games with me" doesn't rate whatsoever.
The user has visited A and B on day 1, and A on day 2. So the total page hits is (A, B, A). Remove duplicates and you have (A, B) which makes the sentence true.
Imagine he said:
> They bought at least two unique products
What would you expect the requirements to be?
But that is purely a cultural convention. O(n²) is great for many important problems. For example, parsing a sentence in natural language is Ω(n³); getting it in O(n²) would be evidence that the candidate was a deity.
Why select for familiarity with the details and conventions of the interviewing process? How is that supposed to be helpful?
> Candidates then switch it around to have CustomerId as the Key, and PageId as the Value of the Map. But that’s not particularly great either because it overlooks the fact that you can have many pages per customer, not just one. Some candidates have the intuition that they need a Collection of pages as the Value of the Map
But this is wrong. You're completely fine having a map from CustomerId to a single PageId, because the problem statement specifies that you're collecting customers who have visited more than one page. If you process a record that says CustomerId = X, PageId = Y, and you look up CustomerId in your map, these are the possibilities:
1. map[CustomerId] has no entry. You write Y as the new entry for that CustomerId.
2. map[CustomerId] is already Y. You do nothing.
3. map[CustomerId] has an entry that is not Y. You now know that CustomerId represents one of the customers you're trying to find.
At no point did you need the map values to represent more than one page.
> The condition “customers that visited at least 2 unique pages” tends to be a little harder for candidates to get right, so if they’re stuck I throw a little hint: you have a Set of pages from Day1, and a single page from Day2… how can you determine that this is at least two unique pages?
> Poor candidates will loop through the elements in the Set to check if the page from Day2 is in there. This turns your O(n) algorithm into O(n²) again. The number of candidates who have done this is surprising.
> Better candidates will do a .contains() on the Set which is an O(1) operation on a hash set. But there is a catch with the logic.
> The intuition to get this right is this: If you are inside that If loop and the customer visited at least two pages in Day1, and they visited any page in Day2, they’re loyal, regardless of which page they visit in Day2. Otherwise, they only visited only one page in Day1, so the question is: is this a different page? If so they’re loyal, else it’s a duplicate so you don’t know and should keep going. So your If statement has an Or:
> [code sample involving the interviewer's terrible solution]
> There’s a need for attention to detail, like using “>” instead of “>=” or missing the “!” in the second statement. I saw these fairly often. I didn’t worry. Great candidates spotted them quickly as they double-checked the algorithm when they were done. Good candidates spotted them after a little bit of hinting. That gave me a good signal on debugging skills.
Why in the world is this presented as a desirable solution? You have a Set of visited pages from day 1 and a single visited page from day 2. You want to know whether the total number of visited pages is more than 1.
Add the page from day 2 to the Set [O(1)], and then count the Set [also O(1)].
> What if you pre-processed the files and sorted them by CustomerId, then by PageId?
> If the files are sorted, then the problem is easier and it’s just a two-pointer algorithm that you can execute in O(n) with O(1) of memory.
> Since the second sort key is by PageId, you follow another two-pointer algorithm to determine that there are at least two unique pages. So it’s a 2-pointer algorithm within a 2-pointer algorithm. It’s kind of a fun problem! I’ll leave the actual implementation as an exercise for the viewer.
> If you want to make the problem even more interesting, you can add a third file. I will leave that as an exercise for the reader as well!
If you're preprocessing the files, why not concatenate them before (or while) sorting them? The asymptotic resource requirements are the same and you end up with one file that can be processed in O(n) time and O(1) space. (Though the result of the algorithm necessarily takes up O(n) space, so I'm not sure how much this should count as an improvement in terms of space requirements...)
This additional preprocessing step makes the generalization to three files trivial. The algorithm is identical: concatenate the files, sort the monofile, and then walk through the sorted entries.