Here’s the unit I ended up with: https://github.com/CGamesPlay/infra/blob/master/private-serv...
There's an idea in software that's a bit like the corollary of Chekhov's gun. Chekhov's gun is about not presaging jarring story elements that will never come to pass. But it's nearly as jarring to leave important story arcs as complete surprises until the end. Producing the gun moments before the curtain goes down would be quite a WTF. We didn't know that was a possibility. That's a niche that some people occupy, but it is a niche.
Introducing things early fights with Locality of Reference, but when we're talking about things of deep, dramatic importance (like an attempted murder, or a reaper process) it's important to introduce that "character" early in the story so that people know that it exists. Failing to do so is a form of deus ex machina and we only appreciate that in very small doses.
So framed that way, I don't see a problem with having to start a killswitch while you're spinning everything up. It's there, people can see it, and know to ask questions about it.
[Unit]
# https://stackoverflow.com/questions/36729207/trigger-event-on-aws-ec2-instance-stop-terminate
Description=unlink agent from remote server
Before=shutdown.target
[Service]
Type=oneshot
EnvironmentFile=-/etc/environment
KillMode=none
ExecStart=/bin/true
ExecStop=/opt/service-name/shutdown-unlink
RemainAfterExit=yes
User=root
[Install]
WantedBy=multi-user.target
If I recall correctly, the KillMode=none is important as it causes the shutdown-unlink binary to escape systemd process supervision. Without it, you may deal with systemd immediately halting your shutdown unit (and killing the process) when it hits the shutdown target."KillMode=none" shouldn't be necessary, unless your command leaves processes running that need to stay running (and in that case you have other problems, since those processes may not actually finish before the system is shut down). Systemd waits for your service to stop before reaching shutdown.target (per the default Before= and Conflicts= dependencies).
Since your command probably needs networking, you also want "After=network.target" to ensure your command runs before network is torn down, as covered in sibling comments.
You have to deal with crash/power failure recovery anyway. So do your housekeeping on startup. Shutdown should be a quick and simple termination.
The 'most' there is doing some effort
It is actually quite a common practice for those being audited for disaster recovery to do exactly that -- yank cables. More realistically, flip some switches
We do it once a year, set aside a region and time... then test our processes
It serves a few purposes, most importantly -- are our services fault tolerant, and can we bring them back?
I think it's reasonable to trap the signals and make a best effort basis, knowing that PID 1 (or the environment) will eventually have to SIGKILL you -- ready or not
Just because we can't save all of the state doesn't mean we shouldn't try
Some do. And the rest occasionally forcibly reboot (kernel panic or hardware failure), need to be manually forcibly rebooted (due to frozen UI), or unexpectedly loose battery, all leading to the same outcome. At least, that’s been my experience with just about every computer, phone, tablet, smartwatch, game console, and smart TV I’ve ever owned. Plus a number of routers. Is your experience different?
It is a table stakes expectation for most servers that they will not lose data when the power goes out, or when the kernel panics, or when the server itself crashes or runs out of memory. If your software requires graceful shutdown, that seems to imply that it will lose data in all those cases.
You can perhaps use graceful shutdown to perform some optimization that allows subsequent startup to go faster, e.g. put things in a clean state that avoids the need for a recovery pass on the next startup... but these days with good journaling techniques "recovery" is generally very fast. When that's the case, it's arguably better to always perform non-graceful shutdown to make sure you are actually testing your recovery code, otherwise it might turn out not to work when you need it.
So yeah, I agree with SoftTalker. Assume all shutdowns will be sudden and unexpected, and design your code to cope with that.
These thing did happen, can happen and will happen.
Even in modern cloud environments. AWS might consider the hardware your EC2 VM is running on unstable, prompting you to replace/move the VM within 24 hours (if it has not already brought down by hardware failure).
Here's a contrived analogy: modern airplanes are designed to stay in the air even if an engine burns out, but we would still rather fly with both engines at full power whenever possible.
The article gives the examples that "A load balancer might stop accepting new connections and disable its readiness endpoint. A database might flush to disk. An agent might inform a cluster it’s leaving the group." All of these seem like they're worth doing, and improve expected case shutdown behavior, though you should also write and test the abrupt shutdown case.
It's easier said than done, of course, but crash-only software is a worthwhile goal IMO.
TimeoutStopSec=0
That's cost me more than one hard power button powerdon (on a desktop machine with the system partition on SSD - unnerving).One of the innumerable things that systemd stops on shutdown gets stuck - permanently - and the machine goes into a state out of which, to my knowledge, is only a powerdown or reset.
I ended up searching for the above and replacing them with a reasonable timeout (several minutes).
Someone asked about the opposite on Twitter: https://twitter.com/DannoHung/status/1585350836074446869
By definition, it sends a SIGTERM signal to all of the daemons that it started. But as the script isn't started before, and doesn't keep a running PID, you don't have a clean way to do it.
I don't understand why they only implemented the SIGTERM call without any alternative.
Edit: Note that my example runlevels are for Solaris, other UNIX/Linux OSes will vary.
It's not.
> you just need to drop a script into /etc/init.d symlink it from the appropriate rc directories.
You just need to drop a unit file into /etc/systemd/system/ and symlink it from the appropriate /etc/systemd/system/${target}.wants/ directories.
Don't tell me that "shutdown.target.wants" and "reboot.target.wants" are harder than "rc0.d" and "rc6.d".
A lot of the article is about ordering of dependencies (don't stop a dependency until after the dependent has stopped). Don't tell me that adding `Before=` and `After=` lines in the unit file is harder than having to remember all of the dependencies and manually figure out the correct "NN" for it all to work correctly.
A lot of the article is about either having your daemon handle SIGTERM, or coming up with the appropriate `ExecStop=` command. The same command you'd be writing in your rc script (the "handle SIGTERM" stuff being for if your rc script simply says `kill $PID`).
That is: The complex parts of the article are things that were complex with sysvinit too.
AT power supplies didn't have any mechanism for the system to tell the power supply it wasn't needed any more. So when you shut down the computer, it would wind up at a screen with a message approximating "it is now safe to switch off your computer", at which point the system would halt.
ATX power supplies added the ability for the OS to trigger an actual power off. But that's a different end-state to halting, and if you halt the system then it stays on. You may wonder why anyone would want to halt when power off is an option, and to be honest I'm not entirely sure -- possibly because you have a hardware watchdog which will trigger a reboot of a halted machine but not of a powered off machine?
I find that CentOS systems that I've used for a while seem to hang on shutdowns; halt -fp is a way to get them down quickly. It is important to terminate any sensitive processes beforehand.
systemctl --force [poweroff|reboot]
From the man page, this means that "shutdown of all running services is skipped, however all processes are killed and all file systems are unmounted or mounted read-only, immediately followed by the powering off."Chrome is giving me an instant NXDOMAIN error.
Dig shows that
$ dig psdn.io @1.1.1.1
...
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19283
...
;; QUESTION SECTION:
;psdn.io. IN A
so then I prefix "www." like is in the URL... $ dig www.psdn.io @1.1.1.1
...
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 64024
...
;; QUESTION SECTION:
;www.psdn.io. IN A
;; ANSWER SECTION:
www.psdn.io. 300 IN CNAME poseidon-www.pages.dev.
Okay, fine: $ dig poseidon-www.pages.dev @1.1.1.1
...
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 55471
...
;; QUESTION SECTION:
;poseidon-www.pages.dev. IN A
...wat??(Where there's no ANSWER section, none was returned, just an AUTHORITY section)
This is reproducible for me with 1.1.1.1, 8.8.8.8 and 9.9.9.9.
dig www.psdn.io @1.1.1.1
; <<>> DiG 9.16.33-RH <<>> www.psdn.io @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37362
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; QUESTION SECTION:
;www.psdn.io. IN A
;; ANSWER SECTION:
www.psdn.io. 126 IN CNAME poseidon-www.pages.dev.
poseidon-www.pages.dev. 126 IN A 172.66.45.44
poseidon-www.pages.dev. 126 IN A 172.66.46.212