It's a cool feature that's no doubt amazing for applications that need it, but it brings a fair amount of complexity vs other deployment strategies.
https://elixirforum.com/t/how-to-tweak-mix-release-to-work-w...
> I’ve spent some time understanding how to do hot code reloading with releases built using mix release, and here I’d like to detail the steps needed, in hopes that it will help someone.
Allows the clients to remain connected and be none the wiser that there was an update at all.
For larger updates we just do hard restarts when in-memory data structures or supervision tree are changed.
* The server runs in a docker container which has an ssh server installed and running in the background. The reason for SSH is simply because that's what edeliver/distillery uses.
* The CI(local github runner) runs in a docker container as well which handles building and deploying the updated releases when merged on master.
* We use edeliver to deploy the hot upgrades/releases from the CI container to the server container. This happens automatically unless stopped which we do for larger merges where a restart is needed.
* The whole deployment process is done in a bash script which uses the git hash for versioning, edeliver for deploying and in the end it runs the database migrations.
I'm not going to say it's perfect but it's allowed us to move pretty damn fast.
https://www.youtube.com/watch?v=XQS9SECCp1I
But I almost never hear Erlang/Elixir/Gleam folks talk about this benefit of the Erlang VM now, even though it seems fairly unique and interesting. Has the community moved away from it? Is it just not that useful?
But if you really need it, it's really great to have that option (e.g. very long running systems which are split in front/back etc), and it can be used in creative ways too (like the Drone example).
Here is a lightning talk I gave about how to use hot-reload for music / MIDI interactions: https://www.youtube.com/watch?v=Z8sGQM6kLvo
"…thanks to hot reloading, which — for once — is useful…"
That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for. It seems like it would be great for tight game dev loop feedback and iteration too, for example, but that's not a traditional use of Erlang either.
Most people are probably running some web services or something and might as well shift machines in and out of a cluster or can wait for old processes to disband on their own, because the new code is backwards compatible with the one in already running processes, and so on.
It can also be relatively hard to do without causing damage to the system. Those who need and can manage it probably don't need it marketed.
"Is it just that people are more comfortable with blue-green deploys, or are blue-green deploys actually better?"
It depends. If you can do a blue-green shift where you gradually add 'fresh' servers/VM:s/processes and drain the old, that's likely to be most convenient and robust in many organisations. On the other hand, if you rely on long running processes in a way where changing their PID:s break the system, then you pretty much need to update them with this kind of hot patching.
"Does Erlang offer any features to minimize damage here?"
The BEAM allows a lot of things in this area, on pretty much every level of abstraction. If you know what you're doing and you've designed your system to fit well into the provided mechanisms the platform provides a lot of support for hot patching without sacrificing robustness and uptime. But it's like another layer of possible bugs and risks, it's not just your usual network and application logic that might cause a failure, your handling of updates might itself be a source of catastrophe.
In practice you need to think long and hard about how to deploy, and test thoroughly under very production like conditions. It helps that you can know for sure what production looks like at any given time, the BEAM VM can tell you exactly what processes it runs, what the application and supervisor trees look like, hardware resource consumption and so on. You can use this information to stage fairly realistic tests with regards to load and whatnot, so if your update for example has an effect on performance and unexpected bottlenecks show up you might catch it before it reaches your users.
And as anyone can tell you who has updated a profitable, non-trivial production system directly, like a lot of PHP devs of ye olden times, it takes a rather strong stomach even when it works out fine. When it doesn't, you get scars that might never fade.
If you have any kind of state in gen_server and the state or assumptions of it have changed, you need to write that code_change thingy that migrates the state both ways between two specific versions. If by some chance this function is bugged, then the process is killed (which is okay), so you need to nail down the supervision tree to make things restartable and also not get into restart loops. Remember writing database migrations for django or whatever ORM of the day? Now do that, but for memory structures you have.
Now, while the function is running it can't be updated of course, so you need gen_server to call you back from the outside of the module. If you like to save function references instead of saving process references in your state, you need to figure out which version you will be actually calling.
If you change the arity of your record, then the old record no longer matches your patterns.
Since updates are not atomic, you will have two versions of the code running at the same time, potentially sending messages that old/new stuff does not expect, and both old and new code should not bug out. And if they do bug out, you have been smart enough to figure out how to recover and actually test that.
Than there is this thing, if somehow something from the version V-2 still running after update to V-1 and you start updating to the latest V, then things happen.
You can deal with all that of course and erlang gives you tools and recipies to make it work. Sometimes you have to make it work, because restarting and losing state is not an option. Also it's probably fun to deal with complex things.
Or you could just do do the stupid thing that is good enough and let it crash and restart instead of figuring out ten different things that could go wrong. Or take a 15 minutes maintenance window while your users are all sleeping (yes, not everybody is doing critical infra that runs 24/7 like discord group with game memes). Or just do blue-green and sidestep it all completely.
Specifically they talk about running with no down time using hot code reloading here: https://youtu.be/pQ0CvjAJXz4?t=2667 but the whole talk is quite interesting regarding availability.
Warning: the video is quite quiet.
I don’t understand the particulars, but one selling point of biff is it’s got built-in support for updating things directly in prod via the REPL.
There’s a fun interview with the biff guy on the podcast “the REPL”. He talks about how much fun it is to develop directly on the prod server, and how horrified people are by it lol.
I manage a few websites written in Lisp, and updating them is as simple as push code, recompile and it works.
Simply loading new code is easy, ensuring the whole system works seems to require a bit more effort.
https://youtu.be/epORYuUKvZ0?si=gkVBgrX2VpBFQAk5
OP talks in the summary about the importance of understanding the process. It's very much true, but you need to understand not only the process your tooling provides, but also what's going on in the background and what hasn't been taken care for you by your tools. I'm afraid these things are rarely understood about hot upgrades, even by experienced Erlang engineers.
If the portion of the app you were hot upgrading was an OTP process like a GenServer, you could theoretically wait for some sort of atomic coordination mechanism to make that fully qualified function call after the new code has loaded, at least in theory.
We use hot code reloading at my work, but haven't had a reason to atomically sync the reload. Most of the time it's a tmux session with `synchronize-panes` and that suffices. If your application can handle upgrades within a module smoothly, it's rare to have a need for some sort of cluster-level coordination of a code change, at least one that's atomic.
It's not just a routing change.
Even within a single VM, hot loading doesn't stop the world, during the load some schedulers will switch before others. Although there are guarantees that mean when a process runs new code and sends a message to another local process, that process will have the new code available when it reads the message. (It may still be running the old code, depending on how it's called though)
Dealing with multiple versions active is part of life in most distributed systems though. You can architect it away in some systems, but that usually involves having downtime in maintenance windows.
A typical pattern is making progressive updates, where if you want to change a request, first you deploy a server that can handle old and new requests, then you deploy the client that sends the new request, then you can deploy a server that no longer accepts old requests.
For new replies, if the new reply comes with a new request, that works like above... a client that sent a new request must handle the new reply. Otherwise, update the client to handle either type of reply, then update the server to send the new reply, finally remove handling of the old reply in the clients.
It gets a bit harder if your team dynamics mean one person/group doesn't control both sides... Then you need stats to tell you when all the clients have switched.
Sometimes you do need more of a point in time switch. If it needs to be pretty good, you can just set a config through a dist 'broadcast'. If it needs to be better than that, you can have the servers and clients change behavior after a specific time... but make sure you understand the realities of clock synchronization and think about what to do for requests in flight. If that's not good enough, you can drop or buffer requests for a little bit before your targer time, make sure there are no in progress requests, then resume processing requests with the new version.
Also:
"Live program changes in the Dart VM"
https://github.com/dart-lang/sdk/blob/main/docs/Hot-reload.m...
"Live reloading for your ESP32"
If you have something_function, then first inline function used in it will be -something_function/1-fun-0- with zero being the index and captured variable being another argument. Now if you will change the host function to have more inlines before it, the indexing will drift.
So I would expect the body of inline function will still be resolved from the old version of the module, but I didn't actually try.
Source: I did run erlc -S at least once.
Add: now thinking of it, will the call to a local function from the old version of the module ever escape into the new one without first returning back to gen_server and letting it call the new version? Another comment says that calls withing the module never do, so the assumption was correct.
The interaction between hot reloads and function captures in general is a bit subtle, particularly when it comes to how a function is captured. A fully qualified function capture is reloaded normally, but a capture using just a local name refers to the version of the module at the time it was captured, but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time. For this reason, you have to be careful about how you capture functions, depending on the semantics you want.
in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container
How do you think video games like World of Warcraft or Path of Exile deploy restartless hotfixes to millions of concurrent players without killing instances? I don't think it's a matter of "prefer to", it's a matter of "can we completely disrupt the service for users and potentially lose some of the state"? Even if that disruption lasts a mere millisecond, in some context it's not acceptable.
I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.
This way they can properly test everything and rollback any potential fixes if required.. even banking systems regularly goes down for maintenance.
It would seem far more likely they seperate the stateful (database) and stateless layers (game logic) and they just spin up a new instance of the stateless server layer behind a reverse proxy and spin down the old instance. It’s basically how all websites update without down time.
Fred Hebert (and many of the folks he has worked with) do not operate that way: <https://ferd.ca/a-pipeline-made-of-airbags.html>
One nice quote (out of many) from the article:
> The thing that stateless containers and kubernetes do is handle that base case of "when a thing is wrong, replace it and get back to a good state." The thing it does not easily let you do is "and then start iterating to get better and better at not losing all your state and recuperating fast".
(And if one wants to argue with that quote, please read the entire essay first. There's important context that's relevant to fully understanding Hebert's opinion here.)
Erlang was originally designed for implementing telephony protocols, where interrupted phone calls were not an acceptable side effect of application updates.
The sibling post “that’s how they update without downtime” is super naive. It is absolutely not how they do it.
I mean, this seems to be "best practices" these days, but I certainly don't prefer it. At least the orchestration I use is amazingly slow. And cold loading changes is terrible for long running processes... this makes deployment a major chore.
It's less terrible if you're just doing mostly stateless web stuff, but that's not my world.
In the time it takes to run terraform plan, I could have pushed erlang code to all my machines, loaded it, and (usually) confirm I fixed what I wanted to fix.
Low cost of deploy means you can do more updates which means they can be smaller which makes them easier to review.
In the RSS feed too: Wed, 07 Dec 2016 https://kennyballou.com/index.xml
Leave hot loading to local/development environments, not production deploys.
Loading configs on the fly can also have some of this risk, but it is much easier to reason about typically.
And to make the black magic work you have to engage with it. Most of the time people don't even bother write to a proper import.meta.hot.accept thingy in javascript. Developers simply hate chores, which is evident by not willing to write proper unit tests (despite knowing that tests work) or writing just enough to let the coverage cop pass ship the build.
A dedicated small team running something like whatsup? Sure, look into the arcane and let it look back at you (although high insight makes one more susceptible to madness you know). But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.
You can also have a streamlined procedure to release stuff. Most changes in my erlang based system consist of "push to staging branch, click to deploy and test, pull to master, click deploy button". Can't be simpler than that. Most changes in such systems are also pretty simple. When you need to add something big, typically not many things are dependent on that, so deploy is also pretty simple.
> But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.
Yeah, we talk here about more complicated things here. If you have something simple, you don't need to use erlang, `python -m http.server` will be even simpler than your php in stupid restartable box, because you don't need a special box, just one small command.
Erlang is not very fast, but that's not what it was built for.
I suspect the aura of mistycism around yet another jit vm is not that warranted
what happened? :D
In serious emergencies I even sometimes end up quickly SSH-en to a prod server and changing the file directly. Which is kind of horrifying but hey customer is happy it got fixed immediately and I get to relax and take my time to write a proper fix. Beats sweating and watching the pipeline build and asking around for people to approve my merge request.
APIs are like database engines; they should rarely change. Making it easy to change them is an anti-pattern.
Engineers don't build bridges with replaceable pillars or skyscrapers with replaceable foundations. When aerospace engineers tried building a plane with replaceable engines, we got Boeing 737 Max...
https://jalopnik.com/how-airlines-decide-to-replace-jet-engi....