Their crew (and flight) scheduling software functions in a manner where it more or less simulates a "perfect" day of operations. Airplanes take off on time, land, and continue on. If anything disrupts this simulation, crew members had to call in and talk to someone to update the computer system to tell it that both the airplane and crew members are not where the system thought they were.
Once the call center got overwhelmed it was a cascading failure with Southwest quickly not understanding where most of their flight crew happened to be at any given moment. It appears they feel the only way to solve this is get everything (planes and crews) back into "starting position" to restart the simulation.
The Sabre reservation system was created by IBM for American Airlines in the 1950s — when a computer filled an entire building floor and was built out of vacuum tubes — and remains actively used today by multiple airlines. The programming language has been switched several times over the past 60+ years, but essential compatibility remains.
Edit: Ok reading comments jogged my memory. It was something to do with the 'Sabre' system which was from the 1950's.
https://blog.geaerospace.com/technology/big-wins-in-flight-e...
Skysolver is a GE Flight Services trademark - there’s a video here showing how it works and SW planes in the video. Contrary to the reddit claim, it does appear to use a predictive algorithm.
Highlight quote from the video: “It is humanly impossible when there’s a major disruption for somebody to figure out what the optimal approach is to get them back on schedule”
Edit: I could see tracking off-duty crew being difficult and done by phone - employees don’t carry beacons and would you want that surveillance from your company? In this situation many crew could be home for the holidays and far from their last known location, and stranded due to problems with other airlines, trains, or roads.
"So the storm came and it impacted ground ops so bad that many many crews were now “unaccounted” for and the system in place couldn’t keep up. Then it happened for several more days. By Xmas evening the CS department had essentially reached the inability to do anything but simple, one off assignments. And to make matters worse, the phone system was updated not too long ago and it was not working well."
"I used to work for a large company trying to fill the void [huge gap in the market for good aviation scheduling software], and our software was damn good too. SW was one of the airlines interested, we would demo it exactly like the scenario today, but it was "too expensive" and they stayed on their homebuilt stuff."
This is a common situation in any industry. A company has a solution it self-developed in its infancy to support itself, and as it grows, it comes to rely very heavily on this software. At a certain point, everyone involved with the software -- from the developers to the users to the customers -- knows that it is not on par with third-party software developed by a team dedicated to all of its potential use-cases and pitfalls, but it is very difficult for the business to replace it because the ongoing cost is a relatively low maintenance cost, and the replacement cost is a one-time, relatively high purchase fee (with, usually, a similar if not higher ongoing maintenance cost). So the immediate reaction is to stay with the low ongoing cost, even though everyone "knows" that the long-term benefits of the third-party solution far outweigh the financial savings of keeping the in-house solution, whether those benefits might be avoiding a national week-long service outage or something simpler, like the ability for staff to get more done in other areas of the company that need attention when getting larger.
I have been there. It can be very difficult to make the financial case when the future benefits are somewhat speculative or intangible. Accountants tend to devalue those benefits, relative to the hard numbers they already have. It does not help that system replacement projects often go over budget and introduce other problems that cost more money. Replacing home-grown systems is hard because they are not just software replacements; they typically also require reworking business processes.
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.
If justifying speculative future benefits to the CFO is the first strike against these sorts of projects, this is the second strike — the operations team has likely been burned by failed IT modernization projects before. How hard are you going to fight for something you aren’t 100% sure is even going to work?
i'd wager complex scheudling software is probably licensed, not sold outright. many many millions a year to license.
but yeah, lot of good factors you cite.
Although it's also a regular occurrence that the replacement of an old 'just works' system brings about a week of downtime, mass disruption and still basic features missing.
Having personally experienced the delays and issues with SW earlier this year I can suggest not flying them other than direct flights. Them not having any hubs means if one plane is late or a crew times out because of delays you are stranded. We got stranded at one airport overnight and got routed thru 2 other airports, changing planes at one, before reaching our destination. They have zero capacity of crews/planes on standby at any airport, probably because of the former CEO leadership and pandemic complications.
Edit: a simple queue theory can explain why no reserved capacity leads to system breakdown.
In 2021, Gary Kelly (former CFO/CEO) announced his retirement and picked Bob Jordan (CS trained, having previously worked on the AirTran integration) to take over as CEO, which seems a realization that Southwest has some serious technical debt to address.
All businesses have highest priorities at a given point in time.
In Southwest's early years, they were legal and operational, hence Herb Kelleher.
2000-2020, I can't say financials weren't top of the list, as Southwest migrated off its older plane models.
2020+, maybe they're technical.
A good company picks the CEO it needs for the moment, not always one particular type.
And it seems to me they've cut direct flights -- still plentiful on many of the hops (say, LAS <-> LAX) but scarcer on longer routes (say SLC <-> LAX).
And on the shorter routes, driving is roughly competitive time-wise...
Add on to that the latest technology woes, the storm was just the thing that tipped over a company that had cut operational capacity to the bone.
I don't see any non-bogus reason to, for example, stop accepting doctor's letters from telehealth visits only in cities where they're particularly short staffed.
They arrange flights and crews so that the right number of planes and people are in the right places at the right times.
There's some tolerance in the system. So if the plane from New York to Cincinnati is late, it's ok-ish. The flight from Cincinnati to Dallas should be able to make it in time if things aren't too bad. Then the flight from Dallas to Phoenix should take off when it should. The Phoenix to Las Vegas flight will never know there was a problem.
It also matters for crew. Pilots can only fly for so many hours. So if you have someone stuck in a holding pattern, that cuts into times.
However, if that plane from New York to Cincinnati shits the bed, it'll fuck over Cincinnati to Dallas, Dallas to Phoenix, and Phoenix to Las Vegas. The failure just cascades. You lose planes, you lose crews, nothing is matching up and everything is fucked.
Now imagine this happens a few hundred times. Thousands of flights are affected.
Other airlines don't have this problem because they can just not do a flight. They fly people into a hub, then out of a hub. Delta will go from New York to Atlanta, and back again. Cincinnati to Atlanta, and back again. They work more like a busses. Miss a bus, catch the next one. So if you crap out a day's of flights, you can still put those people on planes and get them out. You know they're either at the hub or on their spoke. So if they're not in Atlanta, they're in their city.
Most airlines are hub and spoke. Elements shift right (planes leave the hub) then shift left (return). Sometimes they shift two places right. If something disrupts this, like, a plane is stuck in [1] when it needs to be in [0], its generally recoverable when the event clears, or, you can use one of those 2 step moves in another flight to pick up stranded passengers on your way back to the starting position [0].
Southwest operate a point to point model. Assets start at [0] and have to traverse every point on the list through to [N] to succeed. N can be 3, 4 or higher. You can imagine what happens if a disruption happens in the middle of this list. Everybody upstream of the break is left planeless, everybody downstream who wants to go up cant make it further than the airport before the break, and everybody at the breakpoint is miserable.
So Southwest then fallback to a new scheduling model for which their system was not designed. Its like having a graph traversal system and trying to get it to solve matrix equations. Yeah, there are some similarities, but its not really the same.
The pilot gave the current CEO Bob Jordan (who started in early 2022) a vote of confidence and stated that he has strongly signaled his intention to improve the state of SW's systems, but obviously was unable to do so before this meltdown.
Source: https://www.reddit.com/r/SouthwestAirlines/comments/zxg6op/t...
Their stock price is only down ~10% since the meltdown, which seems extremely optimistic to me.
I wholeheartedly disagree. I don't fly SW except to save money, but everyone else I've spoken to with an opinion on airlines loves to say how much they love the carrier.
Coworkers and I went to look at the sea of unclaimed bags yesterday, and I mentioned that SW has drastically underpaid ground crews and have flight crews sleeping in crew rooms in airports, and they said that they'd repeatedly heard how much flight crews enjoy working for the company.
I really don't understand the appeal, but it's definitely out there.
Alaska Hawaiian [Delta American United Southwest] (rough tie here) Frontier Spirit
It depends on what you value. If you always fly first class, of course SW is worse. SW does offer a lot of direct flights that no other airlines offer. For me saving several hours in not changing planes is worth a slightly worse flight experience.
Also SW doesn't gouge you when cancelling or changing flights, unlike all the others.
In the Midwest I know a ton of business travelers (>1 trip/month) who default to Southwest. The weekly travelers tend to prefer the legacy carriers if they're available (so I'm sure there's something to your point even here) but IME Southwest rules the monthly/short notice flyers around me.
Everyone in my family preferred to fly southwest (at least before this round of incidents). Among other things, this was based on a history of better customer service especially in extenuating circumstances. They've been at the top of e.g. JD power customer satisfaction rankings for years.
If you're trying to say no one flies economy because it's their choice... I could certainly afford not to, but I _choose_ otherwise.
This had to be before spirit and frontier came onto the scene.
zero damage. This will all be forgotten in a matter of weeks. ppl will continue flying with whoever has the best rates.
What causes things like this is overbooking. Overbooking happens in more than just butts in seats. If you're not going to have enough ground crew to handle your SLA on flights, flights should be preemptively cancelled to keep them under that SLA. Airlines keep that SLA as low as safety permits. The max is set by the FAA at 3 hours for domestic flights and 4 hours for international flights (both ends).
Southwest's software is doing it's job, it has adjustable tolerances. It will even take into account weather conditions reducing ground crew therefore raising SLA. But, the effect of the weather conditions on the ground crew is also adjustable by humans. As a matter of fact, while one would think it would just reference historical data, that's not exactly true. It references predictive models that are adjusted by experienced individuals. Those individuals can be ordered to adjust the parameters outside their honest assessment to allow for steps at the beginning of the process to operate smoothly.
Weather prediction has a horizon. Southwest allowed excess bookings beyond that horizon, or they allowed excess last minute bookings. This weather event was massive and one-sided, yes, but it was also completely predictable. A human made the decision to widen the guardrails. There aren't a ton of people allowed to do that, I can think of maybe 3.
Disclosure: I wrote software for SWA many years ago. Nothing I've said here is privileged, in fact most airlines operate this exact way.
I consulted for a few large carriers and this was my conclusion also. In fact most very large American enterprises have all the same issues. With airlines the problems simply become really visible all at once really quickly when things go wrong.
For the folks here on HN whose experience is mainly in Silicon Valley: it's hard to appreciate how little the execs at these companies care about software. They don't care, not even a little bit, and they definitely don't care what your opinions are. Their priority is growth, the stock price, and answering to the board (not necessarily in that order). The only carrier I heard about that cared for their IT was Continental and they were bought by United.
Compounding the issue is how staffing works at these companies. Being a full-time employee at an airline is considered attractive because you get benefits including cheap airline tickets. My understanding is that the airline ticket perk used to be much more awesome than it is today.
In any case, airlines bend over backwards not to have full-time employees. Depending on the company they have a vast army of contractors (on shore and off shore) who are on a revolving door policy lasting from 6 to 18 months. These folks come in, get trained up for a few months, turn around tickets in a grueling and dehumanizing environment, then they get to take a hike for a while until they can come back on another short-term stint. I had a colleague who had been full-time at SWA, he said his job literally only consisted of training people up on the systems, he rarely wrote much code himself, he was there to 'keep it all together.'
But honestly, the current crisis is not a surprise. The IT systems at big American enterprises are truly horrific. It is decades of homegrown software "integrated" with decades of acquisitions where systems are smashed together on short timelines in service of quarterly goals.
If you want to find the true culprit here, look up at the broader structure of the economic system. This mess is created by how we run the economy.
Assuming this is the case, where folks can override during an incident like this, it sounds like the simulation tooling doesn’t have a human loop interface to demonstrate potential effects post override.
Does that type of simulation interface exist in the industry? And would it have helped?
They need a "reset" to spend a few days inputting as much of that data as possible into their computer system, at which point they can start operating with some level of efficiency and start functioning again.
Some observations, though.
- Via the departure control, manifests, etc...they DID know where their crews went. Any employee flying non-rev, deadhead, or on duty is recorded by employee number. There was likely a problem with keeping that synced with the crew management platform.
- Large parts of this have little to do with technology. Point to point versus hub and spoke means less slack. And recently, SWA has removed even more slack trying to drive revenue with aggressively optimistic schedules, overbooking, tight crew slack, etc. If they didn't pull pack the schedule in small, manageable pieces like the other airlines did, there's no system that would save them. Seems likely they didn't pull back enough before the storm. There's a brink you can't go past, and a rate of cancels you shouldn't exceed. You have to guess right earlier.
- Crew unions have historically negotiated out anything that looks like big brother, location tracking for example...even on-demand "I'm here". So lots of things that could have made this better didn't exist , on purpose.
- A fair amount of the reset is just practicalities of using planes empty of pax to get rested crews staged in the right cities and bags to the right places.
One of their pilots once confided the reason he had to have them plant a drink cart in the aisle while he popped out to use the bathroom was because they're given such little time between connections they can't even stop at a urinal. (<15 minutes)
They flew too close to the sun, and they had a cascading failure. Womp womp!
> Unlike competitors that use a so-called hub-and-spoke system to funnel passengers to large airports, Southwest is focused on point-to-point service, flying the same aircraft — Boeing Co. 737s — on trips that may hopscotch around the US.
With a hub and spoke system, all the planes go from A to HUB and then from the HUB to somewhere. If the route A-HUB gets saturated, they can put more planes on that, and those planes can always be found at the HUB. This applies to crews too.
You'll have something that looks like this: https://www.airlineroutemaps.com/maps/Delta_Air_Lines/North_...
This comes at the cost of having oversupply at some spots and its harder to offer the "ideal" routes as everyone needs to transfer to another plane with a layover somewhere... and your baggage is more likely to get lost. There's a bit not to like as a passenger on such an airline unless it's a nice one leg route - but then who wants to fly to Detroit?
Southwest is different - they go from anywhere they want to anywhere they want with non-stop flights and picking the most lucrative routes they can. This lowers the effective cost per flight and are likely non-stop flights. Everything that a customer wants.
Southwest routes from 2001 https://www.flickr.com/photos/erussell1984/15863298679 - you can see the lack of hubs there.
This is done through some crafty constraint programming to try to make sure that all the capacity is where it needs to be when it needs to be there.
However, when the capacity hits "holidays - everything is at max", along with "big storm prevents flights from going to where they need to be for the next leg" this system breaks down and planes and crews are out of position or need to sleep. Their software was able to handle this constraint system when it was a smaller company with fewer routes - there were fewer constraints.
The "reset" is not "shut down the computers and start them back up" but rather "let all the crews get their required sleep and then go to the spot where they need to be in order to handle the load - not where they currently are (out of position)".
Hypothetically, this is solvable if you have enough compute... but that's a lot of compute that needs to be recomputed each time something changes (weather, crew gets sick, passenger load changes) and that ends up being impractical and expensive.
---
Related reading:
NYT - What Caused the Chaos at Southwest : https://www.nytimes.com/2022/12/28/travel/southwest-airlines...
WSJ - How Southwest Airlines Melted Down : https://www.wsj.com/articles/southwest-airlines-melting-down... --- https://news.ycombinator.com/item?id=34165791
When it becomes non-perfect and you get a number of events that throw it off (like storms causing certain legs from not getting completed to move the plane to the proper spot and holidays causing disproportionate load in certain parts of the network), then everything gets messy.
However you can get to a very good spot with heuristics. This particular issue with SW looks like bad data collection(crew has to phone their location!) and a combination of lack of reserves, bad weather and holidays surge.
What I think is happening with SW is their planes are like pilots in Elite Dangerous, flying around the country to random destinations as the opportunity arises. This is not what point-to-point means to me, this is more like... sky Uber?
What they don't have is a crew that does MDW - ABQ - MDW - ABQ - MDW - ... but rather a crew that hypothetically does MDW - ABQ - HOU - BWI - MDW . The exact route is based on what capacity people have purchased on different legs and may change from loop to loop based on demand.
However, this means that if there's a something that disrupts part of that (a storm cancels all the flights in MDW and HOU for a day) and there's an odd demand pattern, the plane that needs to do the BWI to MDW route is currently in ABQ. And while there is a plane in BWI that could pick up the leg of BWI to MDW, that crew is currently on a mandatory rest period... but you could use the crew that is scheduled to be deadheaded in from AUS to DCA (and then send them by ground to BWI)... but that plane is delayed.
At some point they say/said "stop, cancel everything - demand is now 0, for all planes, reposition the planes so that we can start flights on Dec 29 according to the purchased demand."
1. Large holiday volume of flights AND a lot of cancelations due to weather happened at the same time. 2. Their software system was unable to recommend optimal next steps for flight crews because it had not been designed to take this specific situation into account.
Other airlines also had problem #1. But since they mostly operate on a hub model it is easy to tell their flight crews where they should go once the weather gets better.
It remembers me long history of failures of post-soviet passenger railroad booking system.
They constantly have issues, it is just impossible reliable buy tickets when hot season.
Situation becomes much better latest years (to be honest, we just don't see them), just because free market - railroads give up large share of load to other transport - to air, buses, private autos, and also large part of tourists go abroad.
First, old Soviet style of management, unfortunately saved in govt backed monopoly.
Second, constantly underpaid computers/software depts.
How about their crews? In-house mobile-phone app?
The Southwest telephone-based system sounds super low-tech. AirTags might work better.
Airlines would like nothing more than to be able to fly passengers in drones and fire all their flight crew.
The software failed, just Jerry-rig something up to get people where they need to be, let the accountants and programmers clear up the mess after the fact.
[Edit]Software helps optimize things, but as long as the crews and suppliers have sufficient faith that they will be treated fairly, you could run this all with paper and pen for a few days while the software gets straightened out.
Decide which hubs are most important, and have someone work out what goes where by hand. Once those routes are up and running, you can work outward to the less trafficked airports. Just work on getting the most people to the places that help the most.
Crews know how many hours they've worked, and can track that themselves for the moment, or perhaps the Captain could do that. Everyone has cell phones, and could route around this damage.
The consensus here on HN seems to be to give in without trying, which I find disturbing.
I think they probably are trying to jerry-rig a system, but the airline industry is heavily regulated for safety reasons (a good idea that has been extremely successful), so it's very difficult to get a plane in the air if you don't know what the fuck you're doing.
How many minutes for that someone to work out where a plane is, how to crew it legally, and where it should go?
How many planes crews and flights are backlogged?
Also, there are 11 hubs if I count correctly[2], so an average of 71 planes/hub.
Surely a team of a people could manage that amount of information in each hub on a temporary basis.
the rules for this are insanely complicated and practically require automation to keep track of if you have more than a small handful of crew members.
https://www.ecfr.gov/current/title-14/chapter-I/subchapter-G...