He said that when he had recently graduated and just moved from Norway to Sweden his first job here was as a technician for the local railway. One of his first assignments was a call where the computer controlling all the track switches had stopped working. Luckily there was a backup, but they needed him to fix it immediately.
He arrived, took a look at the computer and its backup running next to it. He started out by measuring voltages on both, comparing to see what the difference was. After a while some men in suits came in and asked how things were going. He said it was all good, they said it wasn't because now all trains had stopped. Apparently, he had short circuited the backup.
"And that's when I decided to go for a theoretical career!" My teacher happily concluded. The classroom was left in a stunned silence.
I absolutely loved his classes. I bumped into him a few years ago when I was holding a recruitment thing for Ghost Games and it happened to be just after his lecture. I told him how much I enjoyed his courses.
-George Bernard Shaw
"Those who can, do; those who can't, teach; those who can't teach work here."
'Those who can, do; those who can't, teach; those who can't teach, teach teachers.'
I used to be so impressed at European train stations that they had a single sheet that would give the entire day's arrivals and departures, down to which track. I assume that behind that were schedules that weren't so far from being hand-calculated with equivalent sheets that told people and equipment where to be at what times.
I think we lose something when we jump so far forward that you cannot fall back gracefully without a system completely breaking down.
My guess is that the failure is related to signalling or some other knowledge that exists “between” the trains (and brains). Having systems that can negotiate right of way is necessary to increase the number of trains and the speed of each individual train beyond their pre-computer-era limits.
Having all trains stop for one day is probably a judgment call to lower the overall duration: Having all trains stop for a day is a pain in the neck, but you know when you’re done. If you could not wait, you could probably have (quite) a few trains run without problems. However, you would have to work for a very long time to get all the trains back to where they were to resume the regular time table. This would be the preferable option during a time of crisis.
There was a lot of extra impact because this system also feeds the passenger information so their app, website and information screens at the stations were not working either leaving passengers in the dark.
Do not forget that there might several companies running on the same track, cargoa and passenger trains etc.
Managing a busy track is hell.
But if they don't work you don't take off, it's simply not worth the risk.
A crew is stranded in space, because of a computer malfunction and no one on board remembers how they used to be programmed.
It’s more than that: The trains are negotiated at political levels! The “8:18 to Marseilles” (fictitious example) could be a headline in the news if the region refuses to fund it, and the worker’s union may have striked to keep it, while inhabitants’ HOA has negociated with the city to keep it under 12€: They are the object of a convergence of fixed interests. They run for generations: the train I took as a kid is still arriving on the same track today (and in one stop, it stops at track E; tracks A-D were dismantled but never renumbered due to this legacy).
The good thing with rails is that they aren’t going away, and trains can be negotiated for decades, they’re far from being scheduled on-the-fly.
As another poster mentioned, they are many, many independent variables in a railway system. In the old days, people recovered from problems by gut feel and experience was OK although there is usually no "right" answer. The newer systems apply some statistical anlysis and try and make the best decision usually with humans doing the last part.
I'm not sure about "because safety" though since signalling systems protect trains from each other and they only have automation in the sense of first train that arrives gets it route sets, the system cannot make trains crash unless there is a critical bug.
That said, train systems themselves usually maintain a ton of active operational procedures that remain part of staff training which would theoretically allow them to be more resilient and these procedures are kept current and you will often see them used in emergencies. I.E. on automated systems, trackway signaling often exists in the “not lit” off state and if the automated train control system failures, is often capable of lighting up for human controlled movement (usually required to be done at reduced speeds) of the trains during irregular operations like bringing stuck trains to the end of their lines and letting the people in them out at the next station. And any trains that still have a human conductor in a cab (or a cab for a human conductor at all) have procedures for that human to operate the train directly, even on segments of track where automated train control otherwise control all train movements.
We could try and run these systems in these degraded states under these emergency procedures, but most modern systems have safety analysis and engineering which focus on bringing the system to a safe halt state while fixing the underlying issue and returning to full operational state rather than messing around with rolling along in degraded states with unknown but assumed to be increased potential for catastrophic engineering failures with outcomes including loss of life.
(In the US it is generally illegal for a human being to operate a train unaided by some form of train control above 79mph. Most other systems have similar rules. We no longer trust humans themselves to operate these vehicles just because so many human lives can be at stake when they fail.)
Source: I did a stint working on system security and a little bit of electromagnetic compatibility safety engineering for rail systems. It was enlightening to see how the capital E engineers I was working alongside handle these concepts and design risk out of modern systems. The bulk of my work was in the US but the parts that weren’t were on systems that spanned multiple continents and/or train cars which were distributed worldwide. (To my knowledge none of my work has ever related to system discussed here.)
I would appreciate one of those post-outage "what happened" reports like with the AWS and Facebook outages last year. But outside of IT I don't think anyone really expects those over here. And there might be national security considerations preventing such disclosure until any chance of a repeat has been engineered away, at which point everyone will have forgotten.
> It affected the system that generates up-to-date schedules for trains and staff.
...boy oh boy did trying to look into what NS uses for crew scheduling ever send me down a rabbit hole.
I don't know if the systems have changed but while poking around online I found this[1] doc about the Netherlands' timetable revamp around 2006 and it talks about the complexity of TURNI-- their on-the-fly crew scheduling system.
> A typical workday at NS includes approximately 15,000 trips for drivers and 18,000 for conductors. The resulting number of duties is approximately 1,000 for drivers and 1,300 for conductors. This leads to extremely difficult crew scheduling instances. Nevertheless, because of the highly sophisticated applied algorithms, TURNI solves these cases in 24hours of computing time on a personal computer. Therefore, we can construct all crew schedules for all days of the week within just a few days.
Then I found more detail about TURNI's implementation in this[2] paper about optimizing crew scheduling for timetables.
> In the railway industry the sizes of the crew scheduling instances are, in general, a magnitude larger than in the airline industry. Moreover, crew can be relieved during the drive of a train resulting in much more trips per duty than typical in airlines. In other words, the combinatorial explosion is much higher. The latter has made the application of these models in the railway industry prohibitive until recently.
Cool stuff.
Finally, gleaning from ns.nl's careers page[3] everything else in their IT land outside this system runs off SAP (likely including the actual distribution of the output of crew scheduling) so if I had to gamble I'd say the failure happened somewhere in the integration between them.
[1] https://homepages.cwi.nl/~lex/files/Interfaces.pdf
[2] https://repub.eur.nl/pub/11701/ei200803.pdf
[3] https://werkenbijns.nl/werkgebieden/it/sap-specialist-bij-ns...
sidenote: If anyone out there is an SAP specialist ns.nl looks like a pretty great place to work: 36hr week, 5 weeks vacation, pension, and free unlimited 2nd class + low cost 1st class train travel.
Those are pretty standard terms here in .NL. 36 hours is considered fulltime by most employers, including government and semi-government. Pension is offered almost everywhere and at least 4 weeks vacation is the legal minimum.
Maybe except for the _unlimited_ train travel but most employers do offer free train travel between home and work.
This also means that if an IT system goes down on a Sunday, a lot of employees won’t even pick up the phone until Monday 9 AM.
I have a reflex repulsion for off-the-shelf "enterprise" software. You pay millions of dollars to subscribe to a suite of (seemingly) bad, bloated software that takes more work to customize and integrate than it would to fully replace. An army of developers trying to keep a business running with the world's largest Swiss Army knife.
My gut tells me that big names like SAP and Sales Force are easier to sell to executives, and it makes a company seem like it knows what it's doing. And maybe feeling pressure to choose a "respected" brand leads to fewer alternatives? Or maybe all the alternatives are quickly bought out by the behemoths. I really don't know. Maybe someone can enlighten me. It just seems like an enormous turd of inefficiency to me.
[1]: https://blogs.sap.com/2022/03/11/what-you-may-not-know-of-pr...
1 hour more than the standard French contract, and there are also far more holidays here..
> free unlimited 2nd class + low cost 1st class train travel.
.. okay that's actually a great perk. IIRC the French national railway company, SNCF, provide only a limited number of tickets per employee (like 10 per year? someone please correct me if I'm wrong). Their UX is also atrocious though (websites are just shit at workflows, apps are weird and break easily, ticket machines are OK but with really bad touch screens, and train announcement screens are each running Windows XP(with literally hundreds, if not thousands of them on big train stations), while from what i recall NS was acceptable.
For example, The Salvation Army is one of the biggest jokes in all of charitable history.
I wont go into details here, but I would NEVER give a dime to the Salvation "army" (of free slave labor under the guise of "salvation" (its a tax and money fraud to its core (I went through a SHIT ton of financial documents to see what their true operatin is disguised as. AVOID)
TL;DR: TSA uses its place as a church-like service to get special tax incentives, even the buildings in SF were donated to them. They run a for-profit-grey-market of boutiques they supply with "donations" they have running trades with China-Town in SF. They take subsidies, donations and other benefits. They require all that they "help" sign up for state benefits, sign the benefits over to their org, take the benefits, then feed donated & expired processed fod to their labor force.
They only take in able bodied. Force them to go to church each sunday. and make them use external addiction help, such as AA and make it clear they are not a treatment organiztion.
They maintain a very small amount of actual employees, and play on the egos of those in their "program" using military style rankings to appease the egos of the ignorant...
They then profit like mad, corruption is rampant and they feed all their "good" donatins to their side hustle of boutiques and other grey market.
Its a fn racket.
=---
The point being that if you DD a company/org, you can often find out a lot from financial public info, comments from even just a few employees etc..
And we can assume that a position in the IT department just opened up...
https://en.wikipedia.org/wiki/2017_Washington_train_derailme...
Thanks!
The other part of my guess is that they have a SQL db which is stored on /likely/ a windows server instance. I’d even surmise that this instance may be hosted on Azure but that’s speculation, not a good guess.
On Sunday there is also not as much pressure to get things running as there would be on Monday. Or perhaps this is the incentive to pay now and get things fixed on time for the work week? In that case Friday night would again have made more sense unless the attackers have some very specific insight into how much slower restoring without paying is.
My alternative theory is an expired certificate that makes some core systems just not talk to each other anymore. The announcements on the stations, for example, were also out, and they lost control of the mobile apps (couldn't make the apps show that trains didn't run, the in-app scheduler showed all was A-OK), and that sounds quite dissimilar from the core train operation service, making me think it's more of an infrastructure than a specific system's problem. On the other hand, once the app could be controlled again (assuming this singular underlying cause), you'd think they could then also start putting train service back in place and that didn't happen for hours still.
I can't really make the pieces fit together for any theory, so then presumably something multi-faceted (one thing tripping one or two other things so it escalated from restarting one component to not being able to restart the trains anymore the whole day).
Well, as long as you're clearly stating _this_ part of your otherwise validated and highly likely pre-post-mortem is speculation... </s>
That especially. I wonder how expensive that thing is going to cost.
This was, however, not due to a computer glitch, but problems in SBB's electrical grid, which lead to overload of some of the lines and then to a complete shutdown of the entire grid.
Warnings were displayed on control consoles, but were drowned in 1000s of other meaningless warnings (there's a lesson for an UI designer here).
I was in the ICE from Basel to Zurich at the time and after fuming for a few minutes I figured: "What can you do?" and moved to the bar wagon. There we had something of an impromptu party until there was no more power for the cash register and they stopped sales.
In the end it was a (relatively) funny adventure for most and a huge embarrassment for SBB. I don't think they like to be reminded of that faithful day.
[0] https://www.swissinfo.ch/eng/swiss-train-network-shuts-down/...
Funny how the entire system has ground to a halt and the bartender is still worried that their cash register receipts and inventory won’t reconcile and shuts that part down too.
It can even be considered a national security issue in the case of food, medicines and other vital services.
Not a native English speaker. So it's easy for me to confuse the terms. You're right, of course.
Note: I do not consider words like 'unfortunately' and 'extremely unpleasant' apologetic. Also, I understand there are cultural differences between the Netherlands and the USA- no joke.
How would "We are trully sorry for this incident" make it any better ?
I do mind the fact that they don't pay the people their money back if they try to find alternatives to get to their destination. That's 100% unacceptable.
Is there any reason to assume that this is a fact? Usually in these situations, NS is pretty good about paying people back.
/s
About the only thing they could maybe have done was to fall back on back-and-forth shuttle trains between major node stations. But that would only provide a very basic service, with incorrect train materiel being used for the normally expected number of passengers. It would be just enough to get passengers and personnel home. I would like to see something like a pen-and-paper schedule that they can fall back on, but they don't have it. (They did something like that at the first corona lockdowns, although with more planning involved.)
Also probably worth noting is that trains here are used fairly frequently, however unfortunately this isn't the first time trains are stopping - for example snow is a common reason for delayed/reduced service. (Snow isn't very common here in the NLs)
That’s ludicrous, I wish people helping other people this way was more common.
Better to force the operators to fix the boom arm / whatever other cause of power glitches with fault codes like this. Even if they false-trip when there are grid level outages.
Being unable to accelerate or unable to use regenerative braking is not in any way a safety problem. Either situation can happen at any time. (Electric trains have banks of resistors used to dump the unwanted kinetic energy when it's not possible to return it to the power supply, and this was the normal method for decades before regenerative braking became common.)
For instance, running high speed rail without computer signalling would be a huge risk there's no sense in taking ( due to the speeds involved, by the time you see anything of issue like a train or damage in the tracks in front, it's too late to stop).
Scale creates problems that are partially solved via software systems, which only handle a subset of the existing variety and new variety introduced by scaling. This makes the systems leaky with regards to unhandled variety and incredibly rigid (i.e., not robust) against exceptions, exceptions that could often be readily be handled by humans. By software being a snapshot of some small group of humans' knowledge of a system, it is completely unable to respond to live operational exceptions. Stafford Beer's work is interesting background on this type of thing.
> A much more efficient automation.
There are many, many examples where that is both not true and true. For the not true cases, just see literally any customer service interaction with a large company in the past half decade. I suppose this could be arguable though if one asks the question "efficient for who?".
The usage of certain lines at a certain time is calculated in periods of 6 seconds!
We're waaay beyond normal human coordination capacity.
It's now Monday morning here and the trains are back to normal in time for rush hour. So it's not really a big deal.
> The IT failure occurred at the end of the morning. It affected the system that generates up-to-date schedules for trains and staff. This system is important for safe and scheduled operations: if there is an incident somewhere, the system adjusts itself accordingly. This was not possible due to the failure.
If I had to guess, it probably was an issue with scheduling/timetable software rather than anything to do with the trains, rails, etc. Nothing exceedingly seriously or difficult to correct.
"Due to the enormous impact of the failure in the IT system, it is unfortunately not possible to run any trains today." // => "Due to the enormous impact of the failure in the IT system, NS cannot run trains today."
"Although the cause of the failure has now been resolved, the impact is considerable. To be able to start up reliably, systems must be updated and trains must be brought to the right place. That takes time. For our passengers, this is extremely unpleasant news." // => "Although we have fixed the cause of the failure, the failure impacted most of our customers. To fix this failure so that trains would start reliably, we needed to return all trains to a central depot, which took nearly 12 hours. We apologize for the extremely unpleasant interruption."
"The expectation is that tomorrow morning the normal timetable can largely be resumed. The night trains can still run." // => We expect to resume minor operation with night trains first and then major operations tomorrow morning.
"The IT failure occurred at the end of the morning. It affected the system that generates up-to-date schedules for trains and staff. This system is important for safe and scheduled operations: if there is an incident somewhere, the system adjusts itself accordingly. This was not possible due to the failure." // => The IT failure occured at 11am in the scheduling system, a major requirement for safe operation; if train operations are interrupted at one location, the system normally reroutes other trains to prevent collisions. This system failed.
"The international trains are not affected by this failure. For information about the timetables of other transporters, passengers can consult the websites of these transporters." // => The failure did not affect international trains. If you have an inquiry about other transportation services, please consult those services.
"The journey planner is updated." // => We updated the journey planner accordingly to account for the interruptions.
// Generally the use of "passive voice" indicates poorly-taught grammar, however the use could also indicate desire to avoid responsibility.
"Who are we? NS is active in the public transportation sector. We encourage the use of public transportation and keep the Netherlands moving. Our travellers are our 1st, 2nd and 3rd priority in all of our activities, and we do our utmost to make their trip as pleasant and sustainable as possible from door to door."
This is also fascinating. I would expect to find "NS (Netherlands Service) provides rail services for the entire country". Instead they choose vague grammar, more the idea of NS rather than what NS actually is.
"I am a programmer" => "I encourage the use of computers to provide solutions for common business problems". Just a strange use of grammar.
The soft dictatorship works very well, Hungary will go even more down in the future. I'll probably leave Hungary before Hungary leaves EU/NATO and gets into a war.
So Hungarians residing in Netherlands could have been able to vote at Hungarian embassies/consulates in Netherlands (if they made it to the embassy).
edit: It was a joke. :/
Other systems that might normally provide safety critical redundancy could be providing the sole measure of safety, with no other redundancy available in case one of those fails.
“Unsafe” is always defined based on context.
There was an incident a few years ago where a LIRR track crew member stepped in front of a train going 78mph and died.
https://www.ntsb.gov/investigations/AccidentReports/Reports/...
The employee that died had been on duty for 38 of the 50 hours leading up to the fatality. One of the NTSB's recommendations is to implement software that avoids worker fatigue:
"The FRA encourages the use of certified biomathematical models, such as the Fatigue Audit InterDyne Model and the Fatigue Avoidance Scheduling Tool (FAST) by railroads to help them develop work schedules for safety-sensitive employees that align with healthy work-rest scheduling practices; however, these safety measures do not apply to roadway workers. 14 The work schedules developed through biomathematical models avoid many pitfalls causing worker fatigue that arise from excessively long work hours, highly variable work shift times that disrupt human circadian rhythms, and infringement on sleep opportunity times."
TL;DR: it's not about the trains crashing into each other. It's about sleepy people making bad decisions that gets them or their colleagues killed.
Conspiracy theories are so fun.
If it is true that Netherland's and Italy's transport infrastructure were also down I wonder if Russian supported hackers are to blame.
It's been Russian hackers so many times last few years (Hunter Biden's laptop, Trump elections, etc...), it's a sure bet this time it's the same.
/s
Nobody ever got fired for blaming Russians.
See also : assassinations - maybe the death of the top polish government & military and Total's CEO in plane crashes were just freak accidents, but a nagging suspicion that Russia was involved will always remain...