A software engineer, a hardware engineer and a department manager were on their way to a meeting in Switzerland. They were driving down a steep mountain road when suddenly the brakes on their car failed. The car careened almost out of control down the road, bouncing off the crash barriers, until it miraculously ground to a halt scraping along the mountainside.
The car's occupants, shaken but unhurt, now had a problem: they were stuck halfway down a mountain in a car with no brakes. What were they to do?
"I know," said the department manager, "Let's have a meeting, propose a Vision, formulate a Mission Statement, define some Goals and by a process of Continuous Improvement find a solution to the Critical Problems, and we can be on our way."
"No, no," said the hardware engineer, "That will take far too long, and besides, that method has never worked before. I've got my Swiss Army knife with me, and in no time at all I can strip down the car's braking system, isolate the fault, fix it and we can be on our way."
"Well," said the software engineer, "Before we do anything, I think we should push the car back up the road and see if it happens again."
... the big disadvantage to software engineering is we deal in abstractions. Everything down to the foundation moves.
You’re very positive. When you see a developer dealing with a bug, it seems more like they are dealing with a turd palace floating in a sewer.
But I think it's also partly illuminating the fact that hardware engineers are true engineers, while software engineers mostly aren't.
Also the story is unfair on the manager. The manager would call everyone in to have a war room meeting on how to fix this urgent production issues. They’d give a quick overview of the problem the open it up for the experts to talk. While the tech whizzes are talking they would order pizza or yum cha or something for the team.
You’re saying the power to do that makes software engineers ridiculous or impractical?
Engineering inside a digital space gives the kind of debugging abilities that would be straight up miraculous in a physical disciple. If I have one thing, and I want to make ten more of those to test in ten different ways, it’s literally just CTRL+C, CTRL+V. Let’s see a mechanic do that.
Of course software is also limited by hardware capabilities but we can code whatever we want on that hardware as long as it fits within the provided specs.
Only the brakes don't work - the engine still does. Why would they need to push the car uphill?
The "re-run to reproduce" isn't even the most peculiar part of software engineering; it's "we don't look at what has been tried before".
In the OP story example, there's this part: "One time we added ventilation holes to reduce heat, but they were just big enough for wasps to nest in". In a ideal world, the hardware engineers are supposed to know to not have any holes larger than a few millimeters, and other such things. Whereas the software engineers are (yet) not supposed to know much of anything like that. The most prominent example I can think of is "remember to add an index to the database when adding a new type of query to the app", as load testing that would catch it tends not to be done on each and every release.
Go fast and brake things!
"I know," said the department manager, "Let's have a meeting, propose a Vision, formulate a Mission Statement, define some Goals and by a process of Continuous Improvement find a solution to the Critical Problems, and we can be on our way."
Knowingly the hardware engineer and the software engineer looked at each other. "Actually the car is fine, but let's pick a driver that knows what they are doing this time. You simply don't know how to use the breaks and were randomly steering left and right while thinking you're on track, management guy."
It betrays a profound misunderstanding of the situation. The other engineering disciplines don't work like that because they're just soooo much more professional than us. They don't work like that because it is a better way. They work like that because for them, it is the only way. You do not build a hotel, and then realize the ceilings need to be six inches higher, and tear the whole thing down and start over.
If they could work by running ceilingHeight += 6 and hitting "Rebuild", see the hotel rebuilt and the automated unit tests automatically double-check the usability of everything inside for handicapped people etc., all for a grand total of about $2.82, they absolutely would.
Shed your inferiority complex. We are not squalling babies drooling on our blocks while Real Men (with all the pejorative connotations modern political sensibilities see in that term fully intended) are building bridges and dams. We engineer with better tools than they could dream of having, and it's completely expected that that results in highly significant changes to our processes.
Do we sometimes fail to bring enough process to a problem? Yup. But if you think that's a problem unique to programming, I prescribe to you spending several hours with https://www.imdb.com/title/tt4788946/ .
1-A practical problem is being solved in a scientific way.
2-Safety, repeatability, understanding of the how and why of the solution are non-negotiable.
3-The person solving the problem has been credentialed as an engineer in both ethics and scientific rigour.
4-Because he is credentialed, there is non-waivable liability for the engineer signing off on the solution if it fails.
No whining or superiority intended, but if any of the four criteria is missing, you're not practicing engineering. In my experience most software development is missing all four. That's not necessarily a bad thing, it's just that most software development isn't engineering.
Again, nothing intended by it. There's no superiority to engineering over development.
It's like how the word "literally" has come to be used as an intensifier, not strictly in the (ahem) literal sense of the word - and "objectively" is well on it's way down the same path. You can be angry about that, but it's not going to stop the continuing evolution in how the world uses those terms - and "engineering" as a term is exactly the same.
>4-Because he is credentialed, there is non-waivable liability for the engineer signing off on the solution if it fails.
Both of these are bullshit. The engineers that aren’t yet credentialed but do all of the work that the PE signs off on are still certainly engineering.
> 2-Safety, repeatability, understanding of the how and why of the solution are non-negotiable.
Safety critical systems would like a word.
It's like sculpting Clay vs Marble. If you make a boo-boo in clay, it just takes you a second to readjust. If you are working in Marble, and you took a piece out that you weren't supposed to - well it's time to order a new marble block.
Taking the techniques and processes of marble sculpting into the world of clay would just make you a bad (or at least highly inefficient) clay sculptor.
And there's really no point to a dick measuring contest of whether clay or marble sculpting is more worthwhile - they both have their own place in our society.
Case in point: I modeled a thing today, printed it off, and realized that it would be ever so helpful if one part were about a millimeter thicker. 30 seconds later, version 2 of the part is on its way to the printer.
It's fucking _amazing._ I can't wait for stuff like that to become as mainstream as paper printers are.
Although back then in the 90s it was plugging numbers into tables rather than software. Very little science involved in any of it.
And you can't always do it in software, either. I keep a mental list of "things you can't really retrofit onto a mature piece of software" that I keep kicking myself to turn into a blog post sometime. I suppose a quick & trivial example is that pretty much by definition switching languages involves a rewrite.
(Though that involves a bit of definitional footwork where I declare using a compiler that goes from language A to B isn't really "switching to B". Anyone who has tried that maneuver in real life can attest it certainly isn't the same thing as a real full rewrite in the target language, even if it does sometimes have its utility.)
Still, there's literally orders of magnitude difference in how likely a given retrofit is to work and how easy it is for us to add it, and proof of that is precisely in the enormous signature this difference leaves on our processes.
> Shed your inferiority complex. We are not squalling babies drooling on our blocks while Real Men (with all the pejorative connotations modern political sensibilities see in that term fully intended) are building bridges and dams. *We engineer with better tools than they could dream of having, and it's completely expected that that results in highly significant changes to our processes.*
I just want to add, that this comes off as originating from an inferiority complex ;)Why is the term engineer so important to you ? Just do your job, do it right, and ignore criticism.
They are, frankly, welcome to it, right up until they try to ruin my job by adding negative-value processes because of their complex.
30 years later:
/* X systems I modul body I 14.09.1990 */
void xxvcda(int *addr, int sizeof)
Fun stuff. At least this isn't for my job, which has its own fun.
DRF assumes (for safety) that what you have is a public API. For JWT validation errors, you don't normally expect to have error-level logs, as it should be a problem of the API caller, not of the service.
However, if you are using DRF for an internal API, it is indeed useful to change the error handlers to, basically, add tracebacks to all the returned API errors.
By which my friend meant, physics has a way of being checked by physical reality in a way that math or computer science don't.
His area of work was extreme magnetic fields. Experimenting meant building giant copper coils, running enough current in them to melt them in place, and then very quickly detonating explosive around the coil so that, for a fraction of a second, the magnetic field at the center of the coil became the most intense ever built by mankind, before the whole setup was destroyed by the splattering of liquid copper thousands of degrees hot. Errors and miscalculations in that work environment meant that people could die unpleasant deaths very quickly.
So when he looked at math PhD students who at most got chalk dust onto their sweater, calling themselves scientists, he disagreed.
Explosively pumped flux compression generators [1] are fun!
That’s how real EMPs are made, in case anyone is interested in a career in super villainy but doesn’t know where to start.
[1] https://en.m.wikipedia.org/wiki/Explosively_pumped_flux_comp...
For real, though, I've always found things like these fascinating.
I used to be very interested in working with software for weapons (specifically for fighter aircraft) because they've got very interesting problems to solve that usually are very direct. The amount of irrelevant code paths you can add, the amount of really far-away external stuff seems very limited at least in theory, so it seems a very interesting field to work in.
I could never get over that it would feel very bad being part of a process that ultimately might end in killing someone, though, so it was a non-starter.
I've not found mathematicians calling themselves scientists. In fact, it's usually the opposite: they boast that they are not scientists and are not limited by petty reality.
Was Shawn able to access anything on the server that would confirm/deny that the image upload was coming through? Why did the image upload work in the test environment but not in the released version of the app? What was different about the test environment?
In theory, Shawn should have had enough access to the server environment (either by running the servers himself or asking someone to help him diagnose why an upload failed silently) that he should have had a reasonably quick answer to "why is this upload succeeding but not showing up?"
IMO, those lessons (why the upload worked in test but not in production) are significantly more important than "the image mime type was set to 'jpg' but should have been 'jpeg'" The bug is much more inconsequential to why the environment made it so hard to find the bug.
In my case, I had a situation where a desktop application was severely malfunctioning, but errors were not being logged. It took me multiple days to realize that the application was running out of file handles, and that log4net wouldn't log if it couldn't get a file handle. Even though the fix (reverting a very small bugfix) was simple; the real fix was to customize log4net to always keep the log file open. This way, if the application ran out of file handles, the error would be logged.
This reminds me of a product I worked on where several (in fact most) of the production-critical APIs (banking APIs, transfer APIs, etc.) had major undocumented differences between production and test instances to the point where if something was using them you just couldn't ever be sure that what you had was actually correct. Some of this stemmed from some of the APIs technically being mandated by law and there was no interest in actually making them good, but some of them were actual B2B solutions that just sucked for no apparent reason.
At points like these it's (IMO) quite defensible to build a very comprehensive adapter that basically does most of the surface work of the API you're using as best you know right now, i.e. almost pretends that it is the other system to a degree where you're re-implementing large parts of it.
Maybe by "the images simply wouldn't upload." the author meant "did not display", and the file was being uploaded to the data store, was visible when viewed in the data store directly, but would not be displayed in app when requested.
I got the feeling that this is one of those 500-mile email[1] stories, where technical details are omitted for easier storytelling
"I re-uploaded a version with improved error handling, but image uploads were failing without any feedback. You see, normally code screams its errors at you in red text - silence is the goal. Here silence was the problem."
Silence is not quite the goal. Too many developers think silence is the goal, but the goal is actually accuracy. If there's no error, yes, it should be silent. If there's an error that affects the user, there should be a big red alert box. I believe developers should come to love error messages. Well written error messages reveal causes quickly and save everyone a lot of time.
I hope this developer has learned to show error messages more often. That would be a great outcome.
The applications (CLIs, native, web, etc) I've seen that present me with non-actionable errors is a perpetual source of irritation.
"Failed to open file"
"File could not be uploaded"
etc
Not only are these useless to the user who can't do a thing about them except try the same thing again, they're useless to the developer or support engineer who might be trying to help them.
Tangentially, one thing I often ask other senior technical leaders (especially Director, VP or CTO) is: what is the most costly mistake you have made? If you are a junior engineer, make sure you do it sometime. Many/most high-level leaders in tech can tell stories in the $100k to $1m range. I've seen people lose millions of dollars on a project and get promoted immediately after. It is important to understand why that can happen and why it can even be a good thing.
I don't agree. Maybe a failure won't result in people dying in a ball of fire, but it can still cause harm. Even minor harm can still add up at scale.
Frustration from a buggy game could lead to real-world road rage or shouting matches. People have killed themselves because a computer sent them a bogus bill. Businesses have failed because software lost valuable data. People have been murdered because of silly social media apps. People have organized pogroms on Twitter. People have been stalked and assaulted using information leaked by Pokemon Go.
Software has real power. If it didn't, there would be no point in writing it.
But I urge you to consider the other side of the spectrum and the pressures that people can put on themselves. For some, in their search for perfection, they can ruin their own lives. They can see every mistake they make as a personal failure. It is useful to remember that in the vast majority of cases people bounce back from these failures.
You will hear over and over how many entrepreneurs fail in their first businesses, often several times. Most often in life you don't just get a second chance, you get many chances. There a only a few places in life where a single failure is truly catastrophic.
So if you find yourself overwhelmed as a junior engineer, as described in this story. If you feel your stomach in knots and you are terrified your lead in going to eviscerate you in front of a cheering audience - just know you have more latitude to fail and try again than you might expect.
I also find that attitude troubling.
I've worked on software that could loose peoples' cherished data. Now I work on software that could cause flooding if it misbehaves.
Take a bit more pride in your work.
This was in the early 2000s in the games industry. I'm not sure if you are familiar with that culture, but it was a time when the engineers were working 12+ hour days for months at a time. People were pouring their heart, souls and sanity into shipping software, often working until they literally broke down. I remember one engineer boasting that he had worked for 2 months straight without taking a single day off.
In that environment the stress was high and technical discussions could often escalate into heated arguments. We often had to remind ourselves that we were making games and many people working there were supposedly living their childhood dreams. It was important to remember that.
The idea that we didn't take pride in our work or didn't do everything in our abilities to ship the highest quality software is beyond incorrect. It was that excessive pride that we needed to guard against by checking in to reality. It wasn't a call to laziness, it was a call to humility.
Not to say I didn't get stressed out debugging. I had one demo while developing software for AT&T's 5ESS telephone switch. We had only one phone line in the test lab configured for our feature. Attempt after attempt the software just wouldn't work. Knowing the software worked, I checked everything I could all the while stressing out. Finally I asked the lab tech to check the line. Somehow our only configured line was disconnected. The problem was a stupid hardware issue.
Same here. I always enjoyed the challenge of it, especially in complicated systems.
You need to spot unusual behavior, figure out ways to isolate and reproduce it, come up with hypotheses as to what might be happening, design tests for those hypotheses, run the tests, find the solution, test the solution -- exactly like the interplay of experimental and theoretical science.
An actual debugger also makes the act of debugging much nicer to me. I've never understood people who quibble about using either printf debugging or an actual debugger when using both is such a massive win. Arguably good tracing facilities should be highlighted here as well since they're much nicer than printf debugging when they're available.
> Not to say I didn't get stressed out debugging.
This is interesting because I remember very early in my programming journey I used to attach a lot of unnecessary things to not understanding why something was happening (or not happening) and gradually I just came to embrace the "Hang on, why is this so? I don't get it... Ohh, hang on... Wow, it really makes sense why this didn't work!" loop and internalize the fact that it feels really good when I get to the end of it.
The process of not knowing can only really get ruined by other people's expectations and behaviors for me at this point and I've learned over time that this can only be managed by being very assertive about how things are done (language choices, architectural choices, etc.) so as to make the process as easy and quick as possible. Ultimately it's a lot easier (at this point in my career) to convince someone that working with AWS lambda is a bad choice for performance, overall costs, debuggability and development speed (all to varying degrees) than it is to convince them later that there is a good reason that fixing that one problem is taking more time than it ought to.
Someone had been doing manual fixes inserting and removing data for the past 3 years. It became part of his job. He added a recurring event in his calendar just to do that regular clean up. Millions of customers depended on this one individual making sure they had the correct data plan on their phone line.
You can imagine the chaos when he forgets or goes on PTO.
Turns out: if $line->status == STATUS_ACTIVE
one was 'Active' the other was 'active'. No dogs were harmed, but incalculable money was lost over the years.
'What have you done? You solved the case that put our entire family through law school!'
That poor soul is no longer indispensable. I am only half joking. Software is about making things more efficient, but I like to look at the human motivations.
Super frustratingly, Macs populating HL7 fields caused intense pain. It turns out that the character ’ when typed on a Mac keyboard is not compatible with all versions of HL7, or perhaps wasn’t compatible with what the HL7 was passed off to. It’s a distant memory now but it was words like o’clock versus o′clock, or something like that which broke radiology report distribution.
It went on for years before being caught.
Edit: HN is displaying the ’ differently to how it looks when I type it, but it’s still the same character. The fact that we couldn’t see the difference when debugging was half the problem, so this is quite funny.
Another example I remember from earlier days is the BOM in xml files. When it was wrong things could crash in all kind of weird ways, and impossible to see.
I guess there is always a risk that some "helpful" contributor will fix the typo in your definition of HttpHeader::REFERRER as "referer" to make it "referrer" instead, thus completely breaking all the software because nope, that typo is enshrined in the HTTP standard, it's Phillip Hallam-Baker's fault while he was at CERN.
I would guess that some other part of the app doesn't even use the constant at all, and just hardcodes "Active" as a string on its own. Maybe taking the value of a dropdown from the UI and never mapping that back to the actual constant.
Landlord in our office complex installed a touchscreen interface on the outside of the building to dial the various front desks, all of which could buzz a person in. They did this because they had no front-desk receptionist who could see the door.
The thing lasted six months and then started to malfunction badly. The culprit? The interface is running on essentially a big black Android tablet and they installed it on the side of the building that faces East. Mid-spring rolled around and it caught enough sun every day to overheat and fry the touch electronics and part of the screen hardware.
As a software engineer, writing the kind of software I do, I never have to worry about thermal load.
The rail firm posted on Twitter: 'We had severe congestion through Lewisham due to dispatching issues as a result of strong sunlight.'
It added: 'The low winter sun has been hitting the dispatch monitor which prevents the driver from being able to see.'
Funny how the simplest, tiniest bugs are often the hardest to find. Just this morning I burned an hour or two hunting down an off-by-one error. Turns out it was an "index + 1" that I had forgotten to change when I refactored (facepalm).
Many of us suffer from mental health issues up to burnout. Why do we burnout if we're not dealing with life or death? Something is psychologically wrong with this job.
This is, obviously, a multidimensional issue, but one of the probable causes is Alienation. Ultimately, some software developers can, consciously or not, question the meaningfulness of spending 40+ hours a week on stuff that is, to a large extent, pointless - even harmful. Sure, the money is good, but how much cash do you need to buy meaning? How deep in the hedonic threadmill can you go before snapping?
I'm not even sure if these questions apply to myself, it's just something that keeps coming up in my own therapy and among friends in the trade.
Most telling that no sales/marketting/product managers showed up to console the guy. Satire just couldn’t plausibly stretch that far. But maybe I’m too jaded. Would love to see the reply that posited a section to this story where the empathetic individual from one of those domains gave solace to the main character.
A bunch of them are for infant minor injury where it's like don't do an X-ray. If you can see a break on the image you'd do A, but if you can't you'd figure the break might be too small to show up and do A anyway. Kids don't need more radiation, just do A immediately without requesting an X-ray.
I have wondered if my cancer diagnosis is at the edge of this case. There's a step where they do a needle biopsy. But, as far as I can tell that biopsy always either says "Cancer" or "Don't know" and I'm not sure what else they'd do for "Not sure" beyond the next step in the cancer diagnosis...
The the Engineering Manager walked by. "There's a dot," he said, "Get rid of it before Wednesday."
With all the other things going right with the project, why this dot--a single pixel--was so important drove us all crazy. I ran through the assembly code that handled all this over and over again and couldn't see anything wrong. Never a reported error by the assembler. Neither could the project manager. Stayed at work all night to wrap it all up and, on Wednesday morning, everything was done and working perfectly just as the Engineering Manager walked into the room at 8am.
"That dot is still there."
Like the author of the article, I questioned whether I should be in this line of work. I continued rewriting, assembling, and testing every variation of the code I could. At 3:00PM on Friday, I found the issue.
MOVE B #0,D0
Do you see it? Imagine this is the 1990s, with a green screen monitor and a PDP-11.
I understand it’s not applicable in all positions/companies and heavily depends on the team and project size, but I’ve seen it work too many times.
Following this, the next best thing is testing (as others have commented).
> While many other professions struggle to understand and resolve their issues, we have the advantage of being able to experiment multiple times a day with just a few clicks.
So much strained positivity, as if the author was tasked to find something to be thankful for. Look, software engineering is a great job if you enjoy the work. But even in this parable, the problem is the author's week of distress and flailing. It has nothing to do with the work and everything to do with the author.
All the others he turns towards, no matter their discipline, their tools for investigation, or the constraints they faced in a dilemma, relate stories where they have a mature understanding of how their industry works and how they navigate its system. They encounter problems in the course of their work and they resolve them in the course of their work. Maybe it takes a little while to proceed with diagnosis. Maybe it takes a long while to integrate improvements into a later product. Maybe they need to forgo some procedure that they prefer to use.
By the authors account, the embedded engineer, the hardware engineer, the CEO and veterinarian all face greater challenges when solving problems in their work, yet they all speak of their road through those challenges with a confidence that the author lacks. They try to soothe the author and empathize, but none of their stories hint at a week of panicked flailing.
So if they handle their work so much more confidently, is it true that their dilemmas are worse and that the author is lucky to be a software engineer?
If the author listened to their own invented characters, the realization to come to is not that the author is lucky to work in a field with purported "privileges" and that everyone else has it worse (gross!), but that everyone faces dilemmas in their work and that the real skill is in staying cool and confidently relying on the processes of their discipline. And this has everything to do with the person doing the work and nothing to do with the discipline. The realization is that challenges arise in all professions and that you can proceed through them without distress and flailing if you allow yourself patience and confidence.
It's funny because they wrote this as a parable, but they missed the real lesson in the very piece they wrote. Four people reassure them that "We've been there! We all go through this!" and their takeaway is "I need to remember that I'm lucky and that everybody else has it even worse."
And while there are many developers that write software that directly impact in one way or another the lives of other human beings, generally speaking we need to recognize we are in a fairly privileged position, as the author states.
Because I, too, have been there.
Enum types have saved me so many "typo"-related errors. Even in this case, it's not exactly a typo, but the encoding process should make you ask questions about the domain and the meaning of the strings you operate on.
Off-topic: is it really like that in the US? (I'm assuming he's from the US). Like, if you get stuck for a few days, you start to worry about being fired or reprimanded?
Thank you for sharing your misadventure with us.
Not salty, just broke my interest. So now I will go about my day without reading your article. p.s. I'm sure it's very good though
Consider saving that popup until I've read some of the article
And a reminder to myself to do good.
Software development has all of diagnoses, audits and proofreading.
It's an all-encompassing discipline.