To answer the question though, I think probably writing a robust web scraper to search events listings and turn them in to a sharable calendar. It'd be trivial these days but I did it in 1999 in Perl with regexs.
Hah. Always. Hindsight bias and impostor syndrome are a fun mix! I remember writing a blog suite (with comments!) in Perl in the late 90s; back then, without S.O. and other knowledge-sharing beyond some Usenet forums, inventing the wheels as we went along... it was all hard.
Turns out the bias current node (external RBIAS resistor sets bias current) for PCIe was routed too close to an inductor for a power rail. When the CPU was warm, the power rail pulled more current, causing the inductor to ring more, causing the crosstalk on the bias net to screw up the PCIe subsystem and hang the CPU.
Found the issue accidentally on a layout change. Had to prove it by drilling out the via and re-routing the signal with wire.
Then the next hardest problem I found was implementing Genetic Programming in Python (year 2005) http://paraschopra.com/sourcecode/GP/index.php
It was fun but extremely hard for me (at that age!).
After that in 2008, I think the trickiest part for me was to write initial visual editor for Visual Website Optimizer. It involved reverse proxy and inserting JavaScript code into that reverse proxied page, letting the user visually edit the page contents.
Fun days. These days I hardly get to code, though last year I gifted my wife a website (http://wowsig.com) which was super fun.
Along the way I'm pretty sure I also figured out how to build SSA form such that you have your alias analysis results available to be used at SSA construction and therefore redundant computation [2]. I never got to chase that down but it was really interesting.
[1] http://paulbiggar.com/research/#phd-dissertation, esp chapter 6.
* A test suite we wrote for a client's project before a massive refactor was stalling randomly, but would continue when you tried to diagnose the problem. Turns out their user creation code used /dev/random, and the system was running out of entropy and so the code was blocking. Moving the mouse or typing on the keyboard would add entropy, thus cause the tests to resume. Fix was to to use /dev/urandom for tests.
* Found a weird issue with an embedded network stack where a limited broadcast packets to more than 3 devices would result in only a response from a few of them, but directed packets to each device would work fine. Devices reported successfully receiving and transmitting when monitored over serial console. Issue turned out to be a bug in the ARP implementation where it would incorrectly store any ARP response it saw (rather than ARP responses the device requested). Given the embedded system has a limited ARP cache due to memory constraints, when multiple devices wanted to respond, they would all send ARP requests, and the responses would flush the ARP cache, so when the network stack wanted to send the response, it didn't know what MAC to use and just drop it on the floor. A workaround was to increase the ARP cache size.
Funny how this goes completely against the typical operant conditioning a user undergoes when working with computers. Usually if your software hangs up, you want to touch nothing and let it finish. But in this case it's actually additional user activity that's needed.
http://www.quora.com/Whats-the-hardest-bug-youve-debugged
My favourite answers:
Crash Bandicoot: http://www.quora.com/Whats-the-hardest-bug-youve-debugged/an...
Flash Player: http://www.quora.com/Whats-the-hardest-bug-youve-debugged/an...
500-miles email: http://www.ibiblio.org/harris/500milemail.html
Eventually we found the internal JTAG pull up resistances would not be sufficient at certain voltages / temperatures. So it wasn't latchup in the end but the JTAG would halt the processor.
We only found it after days of testing in an oven cycling temperature while stimulating a coil (RF field) close to the device while varying the supply voltage to cause the condition.
All the while the client is not happy that his devices randomly stops working, so we were under quite a bit of pressure.
And of course we only started looking at the hardware after we spent quite a bit of time thinking it was a software bug somewhere.
My problem was a lack of javascript logic firing and the answer was to wait for document load. Simple, I know, but I had nobody around me (physically) that could help me and explaining the problem to people in forums seemed impossibly abstract, primarily because I did not understand what the problem was. It was the context around my code that I had to fix, but I kept looking in the code itself.
That was probably one of my first big "ah ha!" moments and these moments are one of the reasons I still love programming. Tenacity, luck and skill became irrevocably connected that day.
I've solved many other problems over the years, far tougher than this one, but maybe never tougher for me in a relative scale. If I had never solved that problem, I sometimes believe my life path would have been totally different.
Nowadays most of the problems I face have been solved by someone else in a slightly different context and searching for/implementing existing solutions is almost trivial.
The formula is how much protein, carbs, and fat to eat and the appopriate exercise of 3 half hour workout sessions a week. No supplements or anything else. Just food and small amounts of exercise to stimulate hormone response. This is way more complex than some tracker or calorie counter. It takes into account insulin spikes, metabolic damage assessments, glycogen storage, and much more. The hard part here was integrating ~10 different disciplines in various sciences. Everyone had a piece of the puzzle, but we had to put it together.
That is then fed into an app that can then pick foods for you based on your formula that is then constantly refined based on your results. We took 1000 people through test runs tweaking our code to get it right. Now it works for everyone that we put on it and actually uses the system.
Our next challenge is the psychology and habit forming parts of the app we have built.
Oh and of course competing with well funded competitors in the space, but at least nobody can claim our results because they just track things instead of allow people to really plan for health.
Edit: Since you asked, it's called mPact (for metabolism impact) and the corporate site is at http://mPact.io
For me, the first hard thing was implementing this https://en.wikipedia.org/wiki/Dead-end_elimination#Generaliz...
Probably because it was the first algorithm I implemented with no reference implementation to look at.
The second hardest was a high performance proxy that can redirect to another proxy and can collect specific types of non-encrypted data.
Didn't make it into my first paper, hopefully will end up in my thesis :)
At the time it was given to me it was a rough demo with no clear path forward to shipping. We had no metrics to tell how good it was, how good it had to be, or whether we were even making progress. We had no team of computer vision experts to work on core algorithms. We had no idea if the problem was solvable at any amount of power consumption. There were more than a few people within the company who thought it couldn't be done.
I want to be very clear about credit. I put this as the hardest thing I have ever done but I was only the manager in charge of the project. While I built the team and owned the problem, I did not write the code or design the algorithms. I had incredible people who did outstanding engineering work and researchers who advanced the boundaries of computer vision. It was a privilege to work with them and I am proud of them.
Back in the CRT monitor days, I was working for a computer repair company. There was this particular client (in the defence industry) who had monitors that started flickering and having a greenish hue at its sides after a week.
Every week, we had to go to his office to swap the monitors and bring the faulty ones back to recalibrate (it was costly, but hey, its a Defence contract and those pay big bucks)
It didn't matter whether the monitor was brand new or recalibrate ones, it just started flickering and had greenish hue after a week, and it only happened in that room. Other monitors outside that room and in other levels were fine, thus the room was dubbed the Poltergeist Room (as they blamed spirits for messing with it).
One day after the monitor exchange, I returned to the office and my supervisor queried me as to why I didn't reply to his multiple pages (we were using pagers back then). I realised I was in the Poltergeist Room when the pages were sent and therefore did not receive any page. It then dawned on me, "Could it be some electro magnetic interference from another level directly above or below that was playing havoc".
I went back to the client the next day to tell him what I thought and he (being electronics trained) realised that above him was a defence lab carrying out EMF experiments, which could have caused the monitor problems. He got to work to build a simple Faraday cage to prevent EMF from getting to the monitor. Since then, the monitors worked perfectly.
I had encountered some seriously incorrect outputs from the application server. The output in question was a function of internal states and current time (rounded to hours, it was kind of "hourly" display). The application server was set to log many input/output pairs so I was able to identify non-trivial amount of such errors, but I was unable to determine the cause. Common causes like memory corruption, time zones (as the business logic heavily depended on the local time), NTP synchronization and even the interpreter bug were considered and then rejected. Finally, after two weeks or some, I tried to simulate the function with varying current time and fixed internal states, and surprisingly a portion (but not all) of output from the past matched to the observed output!
It turned out that glibc `localtime` can misbehave in the way that it ignores the local timezone when it was unable to read `/etc/localtime`, and the Linux box the server was in had some issue on reading that (I never had fully identified it, this read was probably the only disk I/O from that server anyway). In lieu of this finding I have exhaustively and posthumously inspected the past logs; it was determined that the gross error rate was in the order of 10^-4 (!), and the way `localtime` used meant that the error can only alter a portion of the output. Studying the glibc code revealed that setting `TZ` environment variable would disable the UTC fallback, so I did so and the error was gone.
Lesson: Learn your moving parts, even if you don't know them in advance.
Edit: Whoops, I realized this was ambiguous. I was using an iPad camera to track it and displaying the result as well as using the detection to trigger a camera shutter.
I was happy with this, until a friend challenged me to make it realtime. I managed to re-implement the same thing by using the built-in vector drawing (and as a bonus, this also gave anti-aliasing) and managed to get this down to 15ms.
The third version was using the 3D acceleration, and managed to get 100 lights to render in realtime. Was pretty proud of myself and I wrote an article about it, which was cited a few times by different people.
We started Friday after normal working hour by checking that the backup worked (it did) then proceeded to upgrade the server with a raid backplane and three new scsi disk, installed Windows NT, installed the backup software and started a restore while getting some takeaway.
The restore only took like 15 minutes - and to our horror we discovered that the previous IT admin had set it up to do an incremental backup on the same dat tape overwriting it every day!
Ok, no worry, we had not used the old disk, so we installed it and turned on the computer.... Nothing happened.... Strange, we removed the raid backplane, installed everything as it had been... Still nothing.
After 24+ hours working on the problem, including several hours talking to compaq support (best support ever!) we had to go home for some sleep. When I got back to the server room I fired up Norton Disk Editor and painfully figured out the MBR was all zeros on the disk, luckily the rest of the disk look like correct data!
Several hours later, just before sunday turned to monday I finally got an MBR written using NDE and NDD, booting the system and seeing everything was all right.
Monday we told the customer we had some problems and would do the upgrade another day (after we had taken multiple backups :)
But ... I'm taking the biggest challenge right now. I'm coding my Onyx database client idea for 3rd time. The hardest problem was to start o3db. I failed badly with Onyx 20 years ago by burnout holding over half a million lines of C++ in flow together with nearly 10k lines of my own 4gl, during the 3rd customer installation of Onyx. I was very shy of coding UI/UX afterwards, escaped deep into server stuff, machine learning - escaped as far away from user as possible.
So, my biggest challenge was to start Onyx again: A user facing UI/UX for common business database applications with its own fourth generation language. I've decided for Scheme as an intermediate language this time, and the prototype running well. I now have a non recursive Scheme interpreter and GUI running in browser, able to process the meta tables defining an application. Its still a long road to my vision. But to start a project again, I failed with a burnout 20 years ago, and to code it with actual technology, was the biggest personal challenge.
/join #o3db on freenode, if interested in a startup to create common business database clients for the web.
Tweak -> Compile -> build deployable package -> push to phone -> wait 6 minutes -> test on phone -> repeat....
"Aha!" you say. "You're smashing the stack! Function b() is writing outside its stack frame."
But function b() was provably not doing that.
Function b() called msgrcv(), which has a very badly designed API. It takes a pointer to a structure, and a size parameter. The structure is supposed to be a type field (long), and then a buffer (array of char). The size parameter is supposed to be the size of the buffer, not the size of the structure. The original code that implemented this came from a contractor, and they made the very natural mistake of putting the size of the whole structure in the size field. This meant that an extra long was read from the message queue, and smashed the stack.
But that should mess up the stack from from function b(). How did it mess up a variable in function a()? Well, the compiler put that variable in a register, not on the stack. So when b() was called, it had to save off the registers it was going to use, so a()'s local variable wound up in b()'s stack frame.
It took me most of a month, off and on, to figure that out.
- I worked out, on pen and paper, sorting networks on my own a few years before the Wikipedia article on them existed. I was looking for shortcuts in a Quicksort implementation. I hadn't read Art of Computer Programming yet, which is probably the only other place I would've been likely to read about it. It hadn't been covered in any of the other programming literature that I was devouring at the time.
- I wrote a variable interpolator in COBOL. COBOL has no string operators or anything resembling a string data type. This one was tricky. I was working as a programmer/operator at a school district at the time and the central hub of their IT was a Unisys mainframe that ran COBOL and WFL. There weren't any punch cards anymore, but everything ran as if there were; for any given job to run, say, report cards, you had to go into the WFL job and edit a two-digit school code in half a dozen places, in "digital punch cards", which would then be fed one after the other into COBOL programs. This was error-prone and I wanted a way to define a couple of variables at the top of the job file and then have everything work after that.
- I worked for a BigCo that used Remedy for its internal support systems. There were some latent training issues in the internal support department and support requests kept getting modified by unknown people, which would cause the requests to get mishandled and would irritate various other departments. I found a way to sneak some code into the Remedy forms system and I cobbled together a very rudimentary communications protocol between several forms so that all changes to any form got logged to another form, along with the user's id. Remedy had no loop logic at the time. That actually made it to a Remedy developer's group mailing list once and I was a big fish in a very tiny little puddle for a day.
- I reverse-engineered portions of the .dbf format that FoxPro uses, and wrote software that could convert .dbf files into MySQL tables. The date format was tricky. It was an 8 byte field where the first four bytes were a little-endian integer of the Julian date (so Oct. 15, 1582 = 2299161), and the next four bytes were the little-endian milliseconds since midnight. This is not documented anywhere.
Those are some of my favorites anyway. 30 years of programming, there's been some fun stuff along the way.
Documentation for the tools available seemed to varyingly assume that you either a) understood IPSec well enough and only needed to know how to use this one tool, or b) knew everything you needed to know, minus a few hints on the syntax of individual files.
Eventually I got everything working, but performance was abysmal. Sometimes. Sometimes SSH sessions opened instantly. Sometimes they opened slowly but then worked fine afterwards. Some tools were awful and others worked okay.
Eventually I realized that the IPSec configuration set up two tunnels to Amazon, but only set up actual routing (defining endpoints) for one of them. Thus Amazon was load-balancing packets over both tunnels and my Linux implementation was dropping 50% of packets. For established TCP connections this was fine because we had basically zero latency to VPC so retransmits (for what we were doing) were almost free since they would be discovered when the next packet arrived successfully, but for SYN/ACK packets a drop would result in an annoying wait.
Unfortunately, the tools don't allow you to define redundant/overlapping routes, so I couldn't set up two tunnels; I had to just configure one tunnel and leave the other one down so AWS wouldn't try to send data over it, and then just hope that that endpoint didn't go down at an inopportune time before I'd either set up some kind of load balancing scenario on my internal network (internal BGP maybe? ugh!) or given up entirely on the project.
After weeks of working on this specific task (the VPN setup) and making literally zero progress some days, googling for literal hours with no useful results, and trying various permutations, when I got it working I felt like I was the only person on the planet who'd ever done this before, since I was pretty sure that no one on the internet had ever written about it at least.
Even though the project was ultimately scrapped, I still feel like I learned a lot, and maybe I should feel like it was wasted time, but it also felt like quite an achievement to succeed.
Years ago I came up with a simple equation for determining priority of software engineering bug fixes and small features:
Priority = (Benefit the feature provides to the product) / (Time to complete the feature)
where benefit is defined by the business side using any scale (say 1-100), and time to complete is defined by the assigned software engineer using any unit (perhaps man-hours). Regardless of what range the numbers fall in, 0 to 1 or 0 to 42, you end up with an ordered list of tasks which equally value business value alongside engineering time.
I came up with this while working at a medium sized company. I was frequently tasked with too many things to do. Despite tasks being organized in a Redmine-like tool, the implementation was still done in random order because nobody could define priority. This led to much miscommunication about what I was working on in the recent past and future. I used the equation to better communicate my activity and future plans with the business side. Given an ordered list of tasks from this equation, anyone could see clearly what was being worked on next.
The business side resisted attaching a numeric benefit to the features, presumably because that's hard. But it's equally hard to define the time to complete a software engineering task, and I eventually convinced them we needed to at least try to be scientific about both.
n.b.: I used this while working on a mature system. For a newer project or for tasks with more dependencies, it's probably still complicated to define priority. In the setting I was in, it worked great.
My boss's boss however thought it was condescending and nobody aside from myself ever made use of it. I hope to make use of it again one day, but after one bad experience with a medium sized company, I've stuck with smaller places where this is not as necessary.
A few years ago I was contracting for a company that had a Native American Casino as a client. They wanted to build a gamified app/site to engage their customers more.
The single hardest problem was trying to look at the situation from the players point of view. Gambling like this (slot machines) is inherently an illogical thing to do - they know they're never going to make back the money they put in, but they walk away with a smile night after night.
Trying to rationalise (so we could understand their goals and what they might want out of an app/site targeted at them as players) proved impossible for basically everyone on the team.
It did go live eventually but I don't think it's taken off as they hoped.
I am currently working on my thesis in artificial intelligence which to me seems tough because I have never written a thesis before. However, at work, I am dealing with technical software engineering problems that will seem easy after I have solved them.
My first industry project involved creating a generic form builder which could ultimately be used as a survey tool to draw statistics from. This seemed extremely challenging at the time, but now that all of the design decisions have been made, and complexities solved, I could redo it pretty easily (even though we shouldn't recreate the wheel)
Good thought provoking question though! Thanks!
I left this company long ago but they appear to be going strong still. http://www.giraffic.com/ . I'm sure they improved on that work a lot since then.
If you've already asked, one cool part of that technology is that the order of received packets is not important to assemble the stream. Basically every 1 second of video is reassembled without importance to the order of packets received. You need N packets to assemble a "data frame", IIRC pending incomplete data frames were stored in a simple hash table, but honestly it was so long ago I don't remember.
Of course, the real solution would be to press the dependency provider to release an x64 version, but we were not a priority of theirs.
The problems I failed to solve seem are the ones that seem the hardest, of course. I tried to write a cluster manager / distributed OS by myself, starting almost from scratch, and that was too much. I spent upwards of 4 years on it, and had some success, but I'm starting to move on.
In particular, I learned that having a reasonable amount of security with reasonable amount of development effort in a distributed system is still an unsolved problem. It's basically a bottomless pit of work.
TL;DR: Two level system defects are a 20 year old unidentified noise source that can be described by an oxygen spatially delocalising in an amorphous portion of the underlying circuit.
See http://dx.doi.org/10.1103/PhysRevLett.110.077002 and http://dx.doi.org/10.1088/1367-2630/17/2/023017.
I think React Native is super cool right now if its going to make nice UI easier and have good multiplatform support.
Basically it is one of those unfortunate cases where the first weeks makes everything look really promising (single codebase and all) and it is only after hard work of several months that you realize that there is no way you are going to win this battle.
The way I did this was by using NLTK to compare the hypernym paths of the words in the query against the hypernym paths of the category names. I wrapped it in a tiny flask app and it was surprisingly fast enough for an MVP.
Back in 2002 I was writing a floppy disk driver for the little OS we were writing with a friend. It turned out finding anything else than very sparse documentation was really hard, plus for some unknown reason the floppy drive behavior seemed to be of non-deterministic nature. Maybe the fact that I was 15 didn't help.
At some point, after many nights spent on debugging it, it just worked. I still don't know why. I never changed any line of the code after that moment, by fear of breaking it.
- Anything where you're looking for a race condition. It tends to be hard to reproduce, and instrumentation can make it go away entirely, leaving you with a need to conjecture about what might be happening. Quite satisfying when you find it, but again because it's rare you don't know if you're really solved it.
- Built a cross-platform, cross-language messaging system for trading. Combined UDP and TCP, had detection of downed servers. A lot of fiddling with network stuff, performance optimization on all platforms, both VM and native.
This was the 90's. It was surprisingly hard to implement this in a workable, reliable, secure way, and no one in our company of 50 programmers had ever done such a thing before!
I recall being puzzled for way too long at how to prevent someone from coming to a browser that had just logged off from our web app, and clicking the back button a couple of times to be logged in again.
Now, of course, it is a common and easy-ish task.
There's a very big difference between concept and working code. :)
With cars: troubleshooting and fixing a ferrari 599 without the required factory diagnostic computer. You can't beat a multimeter and some elbow grease. It was a faulty flow meter.
In general: figuring out what to do with my life. It took me a bit but was worth the time. Now I can focus on doing that and just that.
In the end I found out that I managed to write a crack for the game by accident :D Later on I inspected a crack from another team, it would patch the same regions!
And yes, I know a lot of what I just typed will probably put real game programmers' teeth on edge.
It turned out people were running the SLA battery completely flat repeatedly, and subtly "wearing out" the battery.
It was the early days of SOAP, and I had been assigned the task of integrating my employer's software with a third party's, so that the applications could share data. This third party org was a wealthy, powerful mega-corporation; and my employer was, well, not. The third party produced a spec for the interface, expected us to follow it, and offered no help from there.
I built a solution. It worked on my machine. Solved the problem. All was right in the world.
I moved it to the test environment. It worked again. Demoed it for one of our customers, and everyone was pleased.
Deployed it to our first beta tester. One lonely employee working accounts receivable, tucked away in the corner of our customer's office.
It crashed.
I checked everything. I mean everything. There are still particulars of that little Windows 2000 workstation that I can describe vividly. Which programs were installed, which patches were installed, how Windows had been configured, how the firewall worked, I even got permission to install a packet analyzer. My employer only had a handful of customers, and the beta test machine was near our offices, so I was over there personally a lot over the following weeks.
We brought in the customer's network support people. They found nothing. They could see the packets leaving, and an error coming back, but couldn't offer more than that.
We brought in the best networking engineer in my company. He was stumped.
What really shook my confidence was knowing that competitors mine had gotten this interface working. This wasn't some half baked project that I could blame on someone else. Others had succeeded where I'd failed.
I practically had to walk across broken glass to get on the phone with the third party's development team, but with enough pestering I pulled it off.
The phone call involved me sitting at the beta test workstation and firing off a request so that they could view it hitting their servers live. The developer who I spoke with immediately spotted the problem.
You see, when you send a SOAP request, you send the date and time that you're making the request along with it. The clocks on the client and the server were too far out of sync, my requests appeared to be coming from the future, and so the server disregarded them with a blunt error. Interestingly, the workstation clocks at my company's office weren't too far out of sync, which is why it worked in one place and not another.
Stuff I learnt:
1. Third party interfaces require a point of contact at both organisations who can talk with one another. This is non-negotiable.
2. If you send an error message that reads "Error", you're a bad developer and should return your computer science degree to your university and demand a refund.
3. No matter how well written the spec is, something always gets left out.
4. Persistence maters more than anything.
It had three CCD cameras with strip imagers that were combined into a single all-sky image every orbit.
I was given a FORTRAN codebase that dated back to the 70s (supporting functions) and was told to figure out the best way to pick the start and end of the orbit as far as image frames were concerned.
The pointing data was in satellite frame-of-reference quaternions [1], and the satellite orbited about the axis of the Sun-Earth line, approximately.
Approximately was the key. Since it wasn't at a perfect 90 degree angle, the CCD strips each crossed over the plane defined by the Sun-Earth line and the axis orthogonal to the Earth's orbit (I referred to it as "south") at an angle.
So, if you want to stitch together an image of the sky that looks continuous, but the orbit of the imager wobbles a bit, and different discontinuities show up every day, how do you do it?
The leading CCD could be entirely across the southern line when the other CCDs were just starting to cross it. This created a lot of problems with how you define a complete orbit that lacks discontinuities and makes intuitive sense so others can understand the code.
I decided to pick the point where the middle of the central camera crossed the plane as the frame of reference for the start/end point.
Ultimately, this project took me about three months, just to get used to the code base, the spatial coordinates and transformations needed to make sense of the data, and then to finally write the code.
The meaningful changes I made in the commit consisted of about three lines of code.
I found the commit message:
Fixed problem near seam of map where start and end of orbit meet. The orientation of camera 2 at the start of the orbit is now used to draw a reference great circle on the sky. Near this boundary pixels are tested individually to decide whether they are part of the current orbit and should be dropped in the skymap. Introduced torigin to keep track of the time origin for the lowres time map. This is added to the Fits header of the time map as keyword TORIGIN (used to be STIME). Times tfirstfrm and tlastfrm are assigned the time of the first and last frame, respectively, for which at least one pixel was dropped in the skymap. These are written into the main header of the skymap as keywords STIME and ETIME. Added extra extension to lowres maps containg nr of pixels contributing to each lowres bin
[1] https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotati...
My first significant contribution to FOSS was to port OpenOffice.org to work without the then-proprietary Java, so that it could go into Debian main (and other distros with similar requirements). At the time, OO.o took 8 hours to build, or 3 hours with the wonders of ccache, and I was hacking on the build system itself, so incremental builds were often broken. (And the first thing OO.o built was its own implementation of make.) So over the course of a month or so, I would hack on it, rebuild to see it get a bit further, and repeat until it finally built without error. The net result was dozens of patches submitted and merged into Debian and ooo-build, and the 1.1.0-2 changelog entry listed here, which made it all worth it: http://metadata.ftp-master.debian.org/changelogs/main/libr/l... ('The "Wohoo-we-are-going-to-main" release')
The most challenging problems were two different mysterious crashes in BITS (biosbits.org), a Python environment running at the firmware level. Because of the environment, a crash means a sudden unexplained reboot, with no diagnostic information.
First, I was trying to debug a crash in the initial CPU bringup code, which brought the CPU from 16-bit real mode to 32-bit mode. After extensive investigation, including assembly output of characters to the serial port to indicate how far the code got, and hand-comparison of disassembled code with the original, it finally turned out to be a bug in the GNU assembler, mis-assembling an expression with a forward-referenced symbol when in .intel_syntax mode. The forward reference ended up becoming an unresolved relocation (with a 0 placeholder) instead of the intended compile-time constant, resulting in a wild pointer. It was one of the rare instances where the bug really was in the toolchain, combined with an environment that makes debugging a challenge.
The other such bug, in the 64-bit version of the same environment, involved GCC compiling struct assignments into SSE instructions that assume aligned addresses, and GRUB not actually aligning its stack for SSE because it never actually used SSE itself and didn't happen to use struct assignments. Debugging that one involved a quick hack of a general-protection-fault handler that hex-dumped the bytes of code around the instruction pointer, searching for those bytes in the compiled code, and matching that back up with the disassembly and source code.
Most recently, I debugged a race condition in a build system, where disk image manipulation (done by syslinux and mtools) was failing to obtain an flock file lock. The kernel doesn't actually have any way to find out who holds the lock, so I ended up instrumenting the flock syscall to print the conflicting lock holder. Turns out that udev took a file lock on the loopback device as soon as it showed up.
The bug wasn't easy to reproduce: all we saw was that, every once in a while, when queried over $wirelessprotocol, the system would begin answering with crap values (it was supposed to measure some physical quantities, and crap values = meaningless, as in negative active power and hundreds of kV on a mains line), and if you kept on pounding it, it would eventually start "acting funny" -- randomly toggling LEDs and handling commands that were never given in the first place -- before eventually crashing. The problem was very far removed from its core; at first, all I was debugging was "system begins answering with thrashed values after a while".
I was two days into it when a more experienced colleague (I was a junior developer at the time) stepped in to help me. We began suspecting a process was smashing another process' stack when, after removing module after module, the bug was still not clearly reproducible by a particular sequence of steps, but the behaviour it triggered became fairly uniform.
We decided a good way to test this assumption was to modify the context switching routine to dump the current top of the stack over a serial line; unfortunately, that introduced additional delays that prevented the bug from occurring, so it didn't help us. We figured, however, that the handler for $wirelessprotocol's query was in the process that smashed the other process' stack, so we modified that handler to send the top of the stack over wireless (this is where not having a MMU helped, ironically :-) ). The base of the other process' stack could be obtained by just tracing context switches.
Sure enough, if enough commands piled up, that process (which was running some pretty intensive stuff, including floating point operations, on a very resource-constrained system) would smash into the next one's stack, messing up its context's registers.
In retrospect, this wasn't necessarily a difficult bug per se: the concept is well-understood and the theory behind it is trivial. The biggest problem is that it challenges the fundamental way we debug programs: when the CPU starts doing crap, we assume we've instructed it to do crap, and it's (correctly!) following consistently bad instructions. In this case, the CPU ended up following random instructions.
Late one night when no one else was there I ran "top" only to puzzle over that a bunch of identical command lines were consuming all the CPU:
login -p Mkkuow....
I don't remember the exact username but this Mkkuow guy was trying to log into all the terminals on each box.I dont clearly recall how I figured this out but it was the result of capacitive coupling - parasitic capacitance - between the transmit and receive rs232 wires. The OS would transmit "SunOS login:" then get garbage on the receive line. Then it would prompt for the password a few times, eventually to give up and transmit the login prompt again.
The actual username I saw is easy to figure out by graphing the ASCII voltage levels then considering how capacitance works.
The solution was to replace all the cables with a lower capacitance cable. Because that required all new connectors as well as my time to install them my manager Karen Coates required some convincing but in the end the new cable stopped the hangs.
Think about that the next time your code gets you down.