Elixir/Erlang Hot Swapping Code (2016) (opens in new tab)

(kennyballou.com)

225 pointsjustinludwig1y ago112 comments

112 comments

It's worth noting that distillery is deprecated in favor of mix releases, which don't support relups out of the box, and specifically warn against them due to the complexity involved in writing code to support them correctly.

It's a cool feature that's no doubt amazing for applications that need it, but it brings a fair amount of complexity vs other deployment strategies.

thibaut_barrere1y ago

Good point. Someone shared this in case someone wonders:

https://elixirforum.com/t/how-to-tweak-mix-release-to-work-w...

> I’ve spent some time understanding how to do hot code reloading with releases built using mix release, and here I’d like to detail the steps needed, in hopes that it will help someone.

superdisk1y ago

Yeah, note that this article is from 2016. I distinctly remember during that time that these hot-swap deployments were all the rage in the Elixir community, and then fell out of fashion with time.

hauxir1y ago

At kosmi.io we use elixir hot swapping for every small patch/bugfix on the backend. This allows us to deploy updates multiple times a day with 0 disruption.

Allows the clients to remain connected and be none the wiser that there was an update at all.

For larger updates we just do hard restarts when in-memory data structures or supervision tree are changed.

deathtrader6661y ago

Would love to know more how you go about it.

hauxir1y ago

It's a little hacky but I'll try to explain:

* The server runs in a docker container which has an ssh server installed and running in the background. The reason for SSH is simply because that's what edeliver/distillery uses.

* The CI(local github runner) runs in a docker container as well which handles building and deploying the updated releases when merged on master.

* We use edeliver to deploy the hot upgrades/releases from the CI container to the server container. This happens automatically unless stopped which we do for larger merges where a restart is needed.

* The whole deployment process is done in a bash script which uses the git hash for versioning, edeliver for deploying and in the end it runs the database migrations.

I'm not going to say it's perfect but it's allowed us to move pretty damn fast.

modernerd1y ago

Live updating a drone running Erlang in 10ms while it was flying with no application restart and no loss of state impressed me when I saw it in 2021:

https://www.youtube.com/watch?v=XQS9SECCp1I

But I almost never hear Erlang/Elixir/Gleam folks talk about this benefit of the Erlang VM now, even though it seems fairly unique and interesting. Has the community moved away from it? Is it just not that useful?

thibaut_barrere1y ago

A lot of web apps are just well-enough served with a blue-green deployment model. It is less risky.

But if you really need it, it's really great to have that option (e.g. very long running systems which are split in front/back etc), and it can be used in creative ways too (like the Drone example).

Here is a lightning talk I gave about how to use hot-reload for music / MIDI interactions: https://www.youtube.com/watch?v=Z8sGQM6kLvo

modernerd1y ago

Great talk, thanks, nice to see other creative uses. Great idea to add LiveView and SVGs for the keyboard UI.

"…thanks to hot reloading, which — for once — is useful…"

That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for. It seems like it would be great for tight game dev loop feedback and iteration too, for example, but that's not a traditional use of Erlang either.

thibaut_barrere1y ago

> That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for

Actually, I think it is much more common in original Erlang scenarios (including "non-web") where high availability is a useful pre-requisite.

It is in my experience less common in Elixir, which is often more web-oriented (although not exclusively).

epiccoleman1y ago

Extremely cool, thanks for sharing!

cess111y ago

A lot of the GenServer-information floating around explains code_change/3, no? That's commonly what you want, a way to handle state propagation when process code is updating in a running system.

Most people are probably running some web services or something and might as well shift machines in and out of a cluster or can wait for old processes to disband on their own, because the new code is backwards compatible with the one in already running processes, and so on.

It can also be relatively hard to do without causing damage to the system. Those who need and can manage it probably don't need it marketed.

cess111y ago

Someone put a reply and then deleted it while I wrote a response, and it irks me that it might have been a waste so here's the gist of it:

"Is it just that people are more comfortable with blue-green deploys, or are blue-green deploys actually better?"

It depends. If you can do a blue-green shift where you gradually add 'fresh' servers/VM:s/processes and drain the old, that's likely to be most convenient and robust in many organisations. On the other hand, if you rely on long running processes in a way where changing their PID:s break the system, then you pretty much need to update them with this kind of hot patching.

"Does Erlang offer any features to minimize damage here?"

The BEAM allows a lot of things in this area, on pretty much every level of abstraction. If you know what you're doing and you've designed your system to fit well into the provided mechanisms the platform provides a lot of support for hot patching without sacrificing robustness and uptime. But it's like another layer of possible bugs and risks, it's not just your usual network and application logic that might cause a failure, your handling of updates might itself be a source of catastrophe.

In practice you need to think long and hard about how to deploy, and test thoroughly under very production like conditions. It helps that you can know for sure what production looks like at any given time, the BEAM VM can tell you exactly what processes it runs, what the application and supervisor trees look like, hardware resource consumption and so on. You can use this information to stage fairly realistic tests with regards to load and whatnot, so if your update for example has an effect on performance and unexpected bottlenecks show up you might catch it before it reaches your users.

And as anyone can tell you who has updated a profitable, non-trivial production system directly, like a lot of PHP devs of ye olden times, it takes a rather strong stomach even when it works out fine. When it doesn't, you get scars that might never fade.

Muromec1y ago

This is also a reply to that deleted comment, because I had to type it all and also got to go outside and have my European 2 hour long lunch break while doing it.

If you have any kind of state in gen_server and the state or assumptions of it have changed, you need to write that code_change thingy that migrates the state both ways between two specific versions. If by some chance this function is bugged, then the process is killed (which is okay), so you need to nail down the supervision tree to make things restartable and also not get into restart loops. Remember writing database migrations for django or whatever ORM of the day? Now do that, but for memory structures you have.

Now, while the function is running it can't be updated of course, so you need gen_server to call you back from the outside of the module. If you like to save function references instead of saving process references in your state, you need to figure out which version you will be actually calling.

If you change the arity of your record, then the old record no longer matches your patterns.

Since updates are not atomic, you will have two versions of the code running at the same time, potentially sending messages that old/new stuff does not expect, and both old and new code should not bug out. And if they do bug out, you have been smart enough to figure out how to recover and actually test that.

Than there is this thing, if somehow something from the version V-2 still running after update to V-1 and you start updating to the latest V, then things happen.

You can deal with all that of course and erlang gives you tools and recipies to make it work. Sometimes you have to make it work, because restarting and losing state is not an option. Also it's probably fun to deal with complex things.

Or you could just do do the stupid thing that is good enough and let it crash and restart instead of figuring out ten different things that could go wrong. Or take a 15 minutes maintenance window while your users are all sleeping (yes, not everybody is doing critical infra that runs 24/7 like discord group with game memes). Or just do blue-green and sidestep it all completely.

chefandy1y ago

Huh, really? I feel like I see Elixir folks sing the praises of beam pretty regularly. Specifically the OTP supervisor stuff for fault-resistant server deployments. I haven’t looked specifically for that though recently so maybe people are taking it for granted?

GCUMstlyHarmls1y ago

This is a talk about a large scale, resilient elixir/erlang deployment in healthcare.

Specifically they talk about running with no down time using hot code reloading here: https://youtu.be/pQ0CvjAJXz4?t=2667 but the whole talk is quite interesting regarding availability.

Warning: the video is quite quiet.

behnamoh1y ago

Lisp has had this features since day 1. But Lisp-like langs like Clojure, Racket, etc. don't have it. This is one of the fundamental features of Common Lisp and I don't know why most other Lisp-wanna-be's don't implement it.

fiddlerwoaroof1y ago

Clojure has it for a large percentage of functionality: things like https://github.com/clojure-emacs/cider depend on it. However, this mostly stays in dev-time and isn't used much for releases. Which I find a bit funny because Clojure's functional, data-driven philosophy is great for enabling painless hot-code updates

chamomeal1y ago

Can’t you do something like this with clojure?

I don’t understand the particulars, but one selling point of biff is it’s got built-in support for updating things directly in prod via the REPL.

There’s a fun interview with the biff guy on the podcast “the REPL”. He talks about how much fun it is to develop directly on the prod server, and how horrified people are by it lol.

https://biffweb.com/

https://www.therepl.net/episodes/48/

lamuswawir1y ago

Came here to say this. In Lisp, you can just compile a function, or load a file and it just works. It's not even sold as a hot feature, not the way Erlang sells it. It's just a feature.

I manage a few websites written in Lisp, and updating them is as simple as push code, recompile and it works.

davidw1y ago

But what if the system is running and the new function takes different arguments or something? What if there is data loaded in the system, what happens to it?

Simply loading new code is easy, ensuring the whole system works seems to require a bit more effort.

fiddlerwoaroof1y ago

Common Lisp has a bunch of features designed to enable migrating the system. e.g. update-instance-for-redefined-class ( https://www.lispworks.com/documentation/HyperSpec/Body/f_upd... ) lets you write code to update instance data between class versions when a class definition is reloaded.

It turns out, though, that making hot-code reloading work well is mainly a question of how you design your system: designing for hot code reloading isn't all that hard for 90% of cases once you figure out the relevant techniques.

leprechaun10661y ago

We do this in q/kdb+ systems often for patches. An important thing about these languages is that this kind of workflow is part of the core for solving problems. So when you are building a system one of the aspects of its design will always allow for this update method. Then when you push a patch you both know the impact of the change (because you've tested the exact same steps in a dev/QA/UAT/Beta environment) and the work required to do it safely.

Major releases do go through a full shutdown and release cycle though.

osmano8071y ago

Those sites have something like Phoenix LiveView or it's something ad hoc like a simple SSR template engine? Would be nice to have something to handle migrations in the client side code to match the server side API.

dszoboszlay1y ago

Hot code upgrades on the BEAM are awesome, but they're not a piece of cake. If you're also interested in the challenges of making them production safe, I gave a talk about this topic on CodeBEAM Sto earlier this year:

https://youtu.be/epORYuUKvZ0?si=gkVBgrX2VpBFQAk5

OP talks in the summary about the importance of understanding the process. It's very much true, but you need to understand not only the process your tooling provides, but also what's going on in the background and what hasn't been taken care for you by your tools. I'm afraid these things are rarely understood about hot upgrades, even by experienced Erlang engineers.

benzible1y ago

"hot deploys on fly.io to a planet-wide cluster, in 3 seconds.": https://x.com/chris_mccord/status/1785678249424461897

apex_sloth1y ago

I used to work for a company that wanted zero downtime through Erlang's hot code reload feature. While it absolutely works, it requires immense effort and extra code to handle state upgrades and downgrades.

gregors1y ago

The Big Elixir 2018 - Desmond Bowe - Hot Upgrade Are Not Scary

https://www.youtube.com/watch?v=IeUF48vSxwI

robocat1y ago

Great discussion 23 days ago on hot code loading:

https://news.ycombinator.com/item?id=42187761

epiccoleman1y ago

I wonder if this kind of thing could be used to make the Elixir REPL a bit more LISPy. I like iex a good deal, but I often find myself wishing I could just easily eval some code or expression in the editor and have it make its way into the REPL context. (yes, I know you can `r` on a module, but that's pretty clunky compared to something like CIDER).

anonymousDan1y ago

I'm a distributed setup I imagine there could be cases where you want to atomically hot upgrade multiple VMs at the same time. Is this common in practice and if so are there recommended patterns/techniques for doing it?

AlphaWeaver1y ago

Erlang does have a mechanism that allows a module to control when it moves from the "old version" to the "new version" of its own code. Calls to the module with the fully qualified name (e.g. `module:function()`) will invoke the "new code" once it's loaded, but calls within that module using only function names (just `function()`) will continue to invoke the "old code".

If the portion of the app you were hot upgrading was an OTP process like a GenServer, you could theoretically wait for some sort of atomic coordination mechanism to make that fully qualified function call after the new code has loaded, at least in theory.

We use hot code reloading at my work, but haven't had a reason to atomically sync the reload. Most of the time it's a tmux session with `synchronize-panes` and that suffices. If your application can handle upgrades within a module smoothly, it's rare to have a need for some sort of cluster-level coordination of a code change, at least one that's atomic.

Muromec1y ago

There can't be anything atomic in a distributed system. You can't even atomically hot upgrade it on a single VM anyway -- you instead load the new version of the module and let dispatcher know to route new calls into it, the same as you would do with a load balancer and a bunch of load bearing docker hosts, just inside your app.

knome1y ago

erlang has a code_change function in the otp that allows the gen_server to update its current state and start using new code. No connections need be broken with clients, no long running processes need be stopped. Just updated in place.

It's not just a routing change.

https://www.erlang.org/docs/24/man/gen_server

Muromec1y ago

It's a routing change in a sense that gen_server is routing function calls to the new module definition. I know about gen_server and code_change, the point was that conceptually the same mechanism, just on a different level of abstraction.

1 more reply

toast01y ago

I mean, yes, there's cases where you want that. But there's no mechanism for it, because you would have to stop the world, do the load, and then resume.

Even within a single VM, hot loading doesn't stop the world, during the load some schedulers will switch before others. Although there are guarantees that mean when a process runs new code and sends a message to another local process, that process will have the new code available when it reads the message. (It may still be running the old code, depending on how it's called though)

Dealing with multiple versions active is part of life in most distributed systems though. You can architect it away in some systems, but that usually involves having downtime in maintenance windows.

A typical pattern is making progressive updates, where if you want to change a request, first you deploy a server that can handle old and new requests, then you deploy the client that sends the new request, then you can deploy a server that no longer accepts old requests.

For new replies, if the new reply comes with a new request, that works like above... a client that sent a new request must handle the new reply. Otherwise, update the client to handle either type of reply, then update the server to send the new reply, finally remove handling of the old reply in the clients.

It gets a bit harder if your team dynamics mean one person/group doesn't control both sides... Then you need stats to tell you when all the clients have switched.

Sometimes you do need more of a point in time switch. If it needs to be pretty good, you can just set a config through a dist 'broadcast'. If it needs to be better than that, you can have the servers and clients change behavior after a specific time... but make sure you understand the realities of clock synchronization and think about what to do for requests in flight. If that's not good enough, you can drop or buffer requests for a little bit before your targer time, make sure there are no in progress requests, then resume processing requests with the new version.

melvinroest1y ago

Is this like a similar feature in Smalltalk/Pharo and Lisp?

igouy1y ago

Yes, the basics are there in Smalltalk and there's more support built into Erlang.

Also:

"Live program changes in the Dart VM"

https://github.com/dart-lang/sdk/blob/main/docs/Hot-reload.m...

"Live reloading for your ESP32"

https://github.com/toitlang/jaguar

amelius1y ago

Does this hot swapping also work for closures?

Muromec1y ago

Erlang doesn't have closures, because erlang doesn't have variables. The compiler simply desugars it to partially applied function referenced by it's name (yes, those inline functions in fact have names).

If you have something_function, then first inline function used in it will be -something_function/1-fun-0- with zero being the index and captured variable being another argument. Now if you will change the host function to have more inlines before it, the indexing will drift.

So I would expect the body of inline function will still be resolved from the old version of the module, but I didn't actually try.

Source: I did run erlc -S at least once.

Add: now thinking of it, will the call to a local function from the old version of the module ever escape into the new one without first returning back to gen_server and letting it call the new version? Another comment says that calls withing the module never do, so the assumption was correct.

bitwalker1y ago

Erlang absolutely has closures, you are mistaken. What you are referring to are "function captures", which bind a function reference as a value, and there is no environment to close over with those. However, you can define closures which as you'd expect, can close over bindings in the environment in which the closure is defined.

The interaction between hot reloads and function captures in general is a bit subtle, particularly when it comes to how a function is captured. A fully qualified function capture is reloaded normally, but a capture using just a local name refers to the version of the module at the time it was captured, but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time. For this reason, you have to be careful about how you capture functions, depending on the semantics you want.

toast01y ago

> but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time.

Force upgraded is maybe misleading. When a module is loaded for the 3rd time, any processes that still have the first version in their stack are killed. That may result in a supervisor restarting them with new code, if they're supervised.

1 more reply

Muromec1y ago

What does is it look like? I was talking about this thing:

   Val = 1, SumFun = fun(X) -> X + Val end, SumFun(2).

It looks like you define arity 1 function that captures Val, while in fact you define arity 2 function and bind 1 as a first argument. Since you can't redefine Val anyway, it's as good as a closure, but technically it doesn't capture the environment.

Maybe I'm mistaken and there is another way to express it?

1 more reply

slt20211y ago

hot reload of code is nothing new nowadays, but people use it only locally during development for REPL like development style.

in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

diath1y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

How do you think video games like World of Warcraft or Path of Exile deploy restartless hotfixes to millions of concurrent players without killing instances? I don't think it's a matter of "prefer to", it's a matter of "can we completely disrupt the service for users and potentially lose some of the state"? Even if that disruption lasts a mere millisecond, in some context it's not acceptable.

Thaxll1y ago

Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

diath1y ago

> Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

And since the data from the disk/database (whether it's a Lua table, XML structure, JSON object, or a query) is then representend as a low-level data structure, that's essentially what hot reloading is - you deserialize the new data and hot-swap the pointers in the simplest terms.

>I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

In World of Warcraft, you will literally have bosses despawn mid-fight and spawn again with new stats or you will see their health values update mid-fight, all without the players getting interrupted, their spell state getting desynced, or spawned items in the instance disappearing. This can be observed with the release of every single new raid on live streams as Blizzard employees are watching the world first attempts and tweaking/tuning the fights as they happen.

EDIT: Here's such an example, for the majority of the fight the extra tank could keep a spawned monster away from the boss, then mid-fight, the monster suddenly started one-shotting the tank, without the disruption of the instance, this was Blizzard's way of addressing a cheese strat to force the players to do the right as designed: https://www.youtube.com/watch?v=7gMm60BXAjU

1 more reply

tomjakubowski1y ago

LPMUDs ran almost entirely on hot reloadable code written in a quirky language called LPC, which later inspired the Pike language.

I believe that only the "driver" code, which handles system calls and hosts the LPC interpreter and is written in C, couldn't be hot reloaded; everything else running in the game could be reloaded without restarting the server.

I'd guess in the modern day, there would be some games where Lua scripts can be hot-reloaded like any other data, from a database or object store.

1 more reply

swat5351y ago

In addition to what most people said, many other game servers just simply announce upcoming maintenance work and take the services offline until the patches are deployed.

This way they can properly test everything and rollback any potential fixes if required.. even banking systems regularly goes down for maintenance.

qudat1y ago

WoW restarts every week. Not sure that’s better than zero downtime deployments

diath1y ago

That's just how it works when your backend is a hybrid software that utilizes a low-level compiled programming language and a high-level language that runs in its own VM. You can use the latter for gameplay features, and can hotfix on the go, and then for core changes you have to restart, which is also why WoW will hotfix the latter on the go, usually every day on an expansion launch, whereas they defer the bulk of backend changes for the next weekly restart without continuously disrupting the game for players.

AnotherGoodName1y ago

That’s a very big assumption that they do code hotpatching.

It would seem far more likely they seperate the stateful (database) and stateless layers (game logic) and they just spin up a new instance of the stateless server layer behind a reverse proxy and spin down the old instance. It’s basically how all websites update without down time.

diath1y ago

A website that just proxies to another server does not need to do much to restore the previous state to make it look seamless to a user, the client will just perform another GET request that triggers a few SELECT queries, it's far more complex in the context of a video game.

Muromec1y ago

Games do in fact have downtimes on major releases and you have to restart the client too before connecting.

diath1y ago

For major patches/backend changes that require recompiling - yes, for gameplay tweaks/hotfixes - no, hot reloading is preferable where possible.

aeturnum1y ago

I work at a company that deploys Elixir/Erlang and while we do /prefer/ to push a fully tested build in a new container, sometimes things get nasty and we need to console in and re-define a module in production. It's not a "best practice" but it stems the bleeding while the best practice is going though its test suite.

simoncion1y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

Fred Hebert (and many of the folks he has worked with) do not operate that way: <https://ferd.ca/a-pipeline-made-of-airbags.html>

One nice quote (out of many) from the article:

> The thing that stateless containers and kubernetes do is handle that base case of "when a thing is wrong, replace it and get back to a good state." The thing it does not easily let you do is "and then start iterating to get better and better at not losing all your state and recuperating fast".

(And if one wants to argue with that quote, please read the entire essay first. There's important context that's relevant to fully understanding Hebert's opinion here.)

AlphaWeaver1y ago

People may "prefer" simply replacing containers, but as some siblings mention, some applications might require more reliability guarantees.

Erlang was originally designed for implementing telephony protocols, where interrupted phone calls were not an acceptable side effect of application updates.

AnotherGoodName1y ago

FWIW as soon as you start using containers you should be able to handle those containers spinning up/down. Pretty much the whole point of containers. At which point you don’t need to bother with code hot swapping since you already have a mechanism for newer containers to spin up while older ones spin down.

The sibling post “that’s how they update without downtime” is super naive. It is absolutely not how they do it.

Muromec1y ago

That's kinda what erlang does, just on a different level. Your docker and your load balancer are both inside your app.

1 more reply

foota1y ago

Amusingly, this reminds me sort of about the story of a person who joins a new company only to discover that their programming framework is intricately linked to their version control system.

toast01y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

I mean, this seems to be "best practices" these days, but I certainly don't prefer it. At least the orchestration I use is amazingly slow. And cold loading changes is terrible for long running processes... this makes deployment a major chore.

It's less terrible if you're just doing mostly stateless web stuff, but that's not my world.

In the time it takes to run terraform plan, I could have pushed erlang code to all my machines, loaded it, and (usually) confirm I fixed what I wanted to fix.

Low cost of deploy means you can do more updates which means they can be smaller which makes them easier to review.

alberth1y ago

(2016)

dang1y ago

Where do you see that? I couldn't find it.

gnabgib1y ago

It's in the URL :D But yeah, the page doesn't make it clear (and some of the embedded JS has a 2020 date suggesting it's received updates).

In the RSS feed too: Wed, 07 Dec 2016 https://kennyballou.com/index.xml

dang1y ago

Hidden in plain view! Ok, let's put 2016 above, on the assumption that the edits since then haven't been too major.

omertoast1y ago

i'm so sick of this DevOps bullshit i wonder if there is an alternative language that you can hot swap code and do all the black magic stuff while keeping the reliability and performance like Rust.

saurik1y ago

I am shocked at the idea of anyone implying that Erlang is "unreliable"... it's entire reason for existence was to step up the game on reliability.

dcsommer1y ago

GP seems to be implying hot swapping, not Erlang, is unreliable. To which, from my experience using it in Erlang, I heartily agree is fraught. Inconsistent state across nodes is much harder to reason about. When you _must_ ensure consistency, hot swapping is reckless, especially as org size and product complexity increases.

Leave hot loading to local/development environments, not production deploys.

Loading configs on the fly can also have some of this risk, but it is much easier to reason about typically.

Muromec1y ago

I heard the quote that some 50% of the mobile traffic is handled by erlang. Somehow the other 50% seems to be doing just fine (except the usual shitshow on the inside that sofware is everywhere all the time).

simoncion1y ago

> I heard the quote that some 50% of the mobile traffic is handled by erlang.

Given that you can implement OTP in any language (albeit with varying degrees of difficulty), that's not surprising.

The thing to remember is that Erlang was first used in production in like 1986. nearly forty years is more than enough time for the biggest good ideas in Erlang to percolate out into non-BEAM systems.

zanderwohl1y ago

I agree. I don't know much about Erlang but what I've heard seems to indicate it's used for high-uptime systems that handle errors well.

Muromec1y ago

I suspect the causality is reversed. When you have a good designed telecom system, then spmething shaped as erlang happens to be a good tool to create to implement it. The tool than keeps you committed to the design choices you made by being restrictive enough.

LAC-Tech1y ago

As someone who is now in the rust world and very very sympathetic to the Erlang world... you both probably mean completely different things when you say "reliable". The contexts are just world apart.

gf0001y ago

It's not well known, but the JVM has very good hot reload support, and is a very reliable and performant platform.

Muromec1y ago

The black magic comes at the cost of not having one streamlined procedure to release stuff.

And to make the black magic work you have to engage with it. Most of the time people don't even bother write to a proper import.meta.hot.accept thingy in javascript. Developers simply hate chores, which is evident by not willing to write proper unit tests (despite knowing that tests work) or writing just enough to let the coverage cop pass ship the build.

A dedicated small team running something like whatsup? Sure, look into the arcane and let it look back at you (although high insight makes one more susceptible to madness you know). But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.

yetihehe1y ago

> The black magic comes at the cost of not having one streamlined procedure to release stuff.

You can also have a streamlined procedure to release stuff. Most changes in my erlang based system consist of "push to staging branch, click to deploy and test, pull to master, click deploy button". Can't be simpler than that. Most changes in such systems are also pretty simple. When you need to add something big, typically not many things are dependent on that, so deploy is also pretty simple.

> But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.

Yeah, we talk here about more complicated things here. If you have something simple, you don't need to use erlang, `python -m http.server` will be even simpler than your php in stupid restartable box, because you don't need a special box, just one small command.

Muromec1y ago

Do you do 100% of deployments using hot reload? If yes, maybe you should share the recipy with everybody else, since consensus seems to recommend the opposite.

At the very least you will have a different procedure to upgrade the erlang itself, right?

>If you have something simple, you don't need to use erlang

I think on a spectrum of difficult things there is an area between hosting static file on rpi at home and running massivele distributed system full of long running stateful processes.

1 more reply

lamuswawir1y ago

Erlang is built for reliability. They're chasing nine nines. Everything about the BEAM is built to emphasize that, the design choices, the documentation, the recommended practices.

Erlang is not very fast, but that's not what it was built for.

Muromec1y ago

Is it really beam or just otp? Sure, beam gives you processes, network-transparent send, immutable structures and linking-monitoring thingy on top, but is what makes it good to shoot for nines?

I suspect the aura of mistycism around yet another jit vm is not that warranted

jerf1y ago

It is reasonable to conceive of Erlang as encompassing OTP. Perhaps somewhere in the world there is some developer out there hot on Erlang but just hates OTP and doesn't use it, but they must be fairly frustrated at how hard it is to keep OTP out of their code base if they ever need any libraries.

Restarting is arguably the definitive thing that makes Erlang stack the 9s out past what most languages and runtimes can achieve... the thing is, it's more complicated to use in practice than a web page like this makes it look, and it's beyond what most products need. Few applications need the fifth or sixth or seventh nine, and it gets to the point that you can't have it anyhow because your Erlang cluster, no matter how well distributed, itself probably doesn't have 99.99999 availability, and your users probably don't have 99.99999 availability on their own network connection.

It's not impossibly complicated, but it is the sort of thing where you if you want to use the feature you need to have it sort of constantly in mind as you write the rest of your system, and it's a lot easier even in Erlang to just design the system to take entire nodes down and bring them back up, if not the entire cluster down, rather than fuss with hot reloads. I wish Erlang advocates would be more upfront about pitching this as an interesting niche feature, but not really a reason to consider Erlang. Unless you absolutely need it, in which case it can indeed be the thing that puts it on the short list of choices... but as evidenced by the vast, vast majority of software and systems not being on Erlang and managing to get along, there aren't really that many things that need it.

sergiotapia1y ago

we had this wonderful thing in PHP where you would just save a .php file and bada bing it was LIVE.

what happened? :D

cardanome1y ago

We still have that and it is awesome. PHP is better than ever.

In serious emergencies I even sometimes end up quickly SSH-en to a prod server and changing the file directly. Which is kind of horrifying but hey customer is happy it got fixed immediately and I get to relax and take my time to write a proper fix. Beats sweating and watching the pipeline build and asking around for people to approve my merge request.

thanksgiving1y ago

They took away our access after one too many outages :yay:

Muromec1y ago

Sounds like what happened with hot upgrade privileges in some erlang shops too.

1 more reply

jongjong1y ago

Forcing all clients to reload their code at the same time sounds like a bad idea. Allowing different clients to run different incompatible versions of the code at the same time also sounds like a bad idea.

APIs are like database engines; they should rarely change. Making it easy to change them is an anti-pattern.

Engineers don't build bridges with replaceable pillars or skyscrapers with replaceable foundations. When aerospace engineers tried building a plane with replaceable engines, we got Boeing 737 Max...

tzmudzin1y ago

Engine replacement happens on airplanes fairly frequently. You don't want to scrap an airplane because of a single damaged turbine blade, or even keep it on the ground for longer.

https://jalopnik.com/how-airlines-decide-to-replace-jet-engi....

jongjong1y ago

Yes but the new parts meet the specs of the original design. The design itself isn't flexible. You can't make the engines significantly bigger without significantly revising the blueprint as a whole. That was the Boeing Max lesson. Just changing the software was not enough.

p_l1y ago

737 MAX had nothing to do with replaceable engines, but with trying to run an ancient airframe with new engines but without necessary upgrades to support the new engines because of costs.

jongjong1y ago

Replaceable at the design level. OMG. Why do I have to explain everything? Clearly I'm talking about blueprints here. Code is a blueprint since you can launch multiple processes/instances running the same code.

p_l1y ago

And then you went and got even less on track, because offering multiple engines and re-engining aircraft is the norm, sometimes to very different engines (like Russian engines offered as upgrades to old Mirages)

1 more reply

j / k navigate · click thread line to collapse

112 comments

Volundr1y ago

It's a cool feature that's no doubt amazing for applications that need it, but it brings a fair amount of complexity vs other deployment strategies.

thibaut_barrere1y ago

Good point. Someone shared this in case someone wonders:

https://elixirforum.com/t/how-to-tweak-mix-release-to-work-w...

> I’ve spent some time understanding how to do hot code reloading with releases built using mix release, and here I’d like to detail the steps needed, in hopes that it will help someone.

superdisk1y ago

Yeah, note that this article is from 2016. I distinctly remember during that time that these hot-swap deployments were all the rage in the Elixir community, and then fell out of fashion with time.

hauxir1y ago

At kosmi.io we use elixir hot swapping for every small patch/bugfix on the backend. This allows us to deploy updates multiple times a day with 0 disruption.

Allows the clients to remain connected and be none the wiser that there was an update at all.

For larger updates we just do hard restarts when in-memory data structures or supervision tree are changed.

deathtrader6661y ago

Would love to know more how you go about it.

hauxir1y ago

It's a little hacky but I'll try to explain:

* The server runs in a docker container which has an ssh server installed and running in the background. The reason for SSH is simply because that's what edeliver/distillery uses.

* The CI(local github runner) runs in a docker container as well which handles building and deploying the updated releases when merged on master.

* We use edeliver to deploy the hot upgrades/releases from the CI container to the server container. This happens automatically unless stopped which we do for larger merges where a restart is needed.

* The whole deployment process is done in a bash script which uses the git hash for versioning, edeliver for deploying and in the end it runs the database migrations.

I'm not going to say it's perfect but it's allowed us to move pretty damn fast.

modernerd1y ago

Live updating a drone running Erlang in 10ms while it was flying with no application restart and no loss of state impressed me when I saw it in 2021:

https://www.youtube.com/watch?v=XQS9SECCp1I

thibaut_barrere1y ago

A lot of web apps are just well-enough served with a blue-green deployment model. It is less risky.

But if you really need it, it's really great to have that option (e.g. very long running systems which are split in front/back etc), and it can be used in creative ways too (like the Drone example).

Here is a lightning talk I gave about how to use hot-reload for music / MIDI interactions: https://www.youtube.com/watch?v=Z8sGQM6kLvo

modernerd1y ago

Great talk, thanks, nice to see other creative uses. Great idea to add LiveView and SVGs for the keyboard UI.

"…thanks to hot reloading, which — for once — is useful…"

thibaut_barrere1y ago

> That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for

Actually, I think it is much more common in original Erlang scenarios (including "non-web") where high availability is a useful pre-requisite.

It is in my experience less common in Elixir, which is often more web-oriented (although not exclusively).

epiccoleman1y ago

Extremely cool, thanks for sharing!

cess111y ago

A lot of the GenServer-information floating around explains code_change/3, no? That's commonly what you want, a way to handle state propagation when process code is updating in a running system.

It can also be relatively hard to do without causing damage to the system. Those who need and can manage it probably don't need it marketed.

cess111y ago

Someone put a reply and then deleted it while I wrote a response, and it irks me that it might have been a waste so here's the gist of it:

"Is it just that people are more comfortable with blue-green deploys, or are blue-green deploys actually better?"

"Does Erlang offer any features to minimize damage here?"

Muromec1y ago

This is also a reply to that deleted comment, because I had to type it all and also got to go outside and have my European 2 hour long lunch break while doing it.

If you change the arity of your record, then the old record no longer matches your patterns.

Than there is this thing, if somehow something from the version V-2 still running after update to V-1 and you start updating to the latest V, then things happen.

chefandy1y ago

GCUMstlyHarmls1y ago

This is a talk about a large scale, resilient elixir/erlang deployment in healthcare.

Specifically they talk about running with no down time using hot code reloading here: https://youtu.be/pQ0CvjAJXz4?t=2667 but the whole talk is quite interesting regarding availability.

Warning: the video is quite quiet.

behnamoh1y ago

fiddlerwoaroof1y ago

chamomeal1y ago

Can’t you do something like this with clojure?

I don’t understand the particulars, but one selling point of biff is it’s got built-in support for updating things directly in prod via the REPL.

There’s a fun interview with the biff guy on the podcast “the REPL”. He talks about how much fun it is to develop directly on the prod server, and how horrified people are by it lol.

https://biffweb.com/

https://www.therepl.net/episodes/48/

lamuswawir1y ago

Came here to say this. In Lisp, you can just compile a function, or load a file and it just works. It's not even sold as a hot feature, not the way Erlang sells it. It's just a feature.

I manage a few websites written in Lisp, and updating them is as simple as push code, recompile and it works.

davidw1y ago

But what if the system is running and the new function takes different arguments or something? What if there is data loaded in the system, what happens to it?

Simply loading new code is easy, ensuring the whole system works seems to require a bit more effort.

fiddlerwoaroof1y ago

leprechaun10661y ago

Major releases do go through a full shutdown and release cycle though.

osmano8071y ago

dszoboszlay1y ago

https://youtu.be/epORYuUKvZ0?si=gkVBgrX2VpBFQAk5

benzible1y ago

"hot deploys on fly.io to a planet-wide cluster, in 3 seconds.": https://x.com/chris_mccord/status/1785678249424461897

apex_sloth1y ago

gregors1y ago

The Big Elixir 2018 - Desmond Bowe - Hot Upgrade Are Not Scary

https://www.youtube.com/watch?v=IeUF48vSxwI

robocat1y ago

Great discussion 23 days ago on hot code loading:

https://news.ycombinator.com/item?id=42187761

epiccoleman1y ago

anonymousDan1y ago

AlphaWeaver1y ago

Muromec1y ago

knome1y ago

It's not just a routing change.

https://www.erlang.org/docs/24/man/gen_server

Muromec1y ago

1 more reply

toast01y ago

I mean, yes, there's cases where you want that. But there's no mechanism for it, because you would have to stop the world, do the load, and then resume.

Dealing with multiple versions active is part of life in most distributed systems though. You can architect it away in some systems, but that usually involves having downtime in maintenance windows.

It gets a bit harder if your team dynamics mean one person/group doesn't control both sides... Then you need stats to tell you when all the clients have switched.

melvinroest1y ago

Is this like a similar feature in Smalltalk/Pharo and Lisp?

igouy1y ago

Yes, the basics are there in Smalltalk and there's more support built into Erlang.

Also:

"Live program changes in the Dart VM"

https://github.com/dart-lang/sdk/blob/main/docs/Hot-reload.m...

"Live reloading for your ESP32"

https://github.com/toitlang/jaguar

amelius1y ago

Does this hot swapping also work for closures?

Muromec1y ago

So I would expect the body of inline function will still be resolved from the old version of the module, but I didn't actually try.

Source: I did run erlc -S at least once.

bitwalker1y ago

toast01y ago

> but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time.

1 more reply

Muromec1y ago

What does is it look like? I was talking about this thing:

   Val = 1, SumFun = fun(X) -> X + Val end, SumFun(2).

Maybe I'm mistaken and there is another way to express it?

1 more reply

slt20211y ago

hot reload of code is nothing new nowadays, but people use it only locally during development for REPL like development style.

in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

diath1y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

Thaxll1y ago

Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

diath1y ago

> Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

>I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

1 more reply

tomjakubowski1y ago

LPMUDs ran almost entirely on hot reloadable code written in a quirky language called LPC, which later inspired the Pike language.

I'd guess in the modern day, there would be some games where Lua scripts can be hot-reloaded like any other data, from a database or object store.

1 more reply

swat5351y ago

In addition to what most people said, many other game servers just simply announce upcoming maintenance work and take the services offline until the patches are deployed.

This way they can properly test everything and rollback any potential fixes if required.. even banking systems regularly goes down for maintenance.

qudat1y ago

WoW restarts every week. Not sure that’s better than zero downtime deployments

diath1y ago

AnotherGoodName1y ago

That’s a very big assumption that they do code hotpatching.

diath1y ago

Muromec1y ago

Games do in fact have downtimes on major releases and you have to restart the client too before connecting.

diath1y ago

For major patches/backend changes that require recompiling - yes, for gameplay tweaks/hotfixes - no, hot reloading is preferable where possible.

aeturnum1y ago

simoncion1y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

Fred Hebert (and many of the folks he has worked with) do not operate that way: <https://ferd.ca/a-pipeline-made-of-airbags.html>

One nice quote (out of many) from the article:

(And if one wants to argue with that quote, please read the entire essay first. There's important context that's relevant to fully understanding Hebert's opinion here.)

AlphaWeaver1y ago

People may "prefer" simply replacing containers, but as some siblings mention, some applications might require more reliability guarantees.

Erlang was originally designed for implementing telephony protocols, where interrupted phone calls were not an acceptable side effect of application updates.

AnotherGoodName1y ago

The sibling post “that’s how they update without downtime” is super naive. It is absolutely not how they do it.

Muromec1y ago

That's kinda what erlang does, just on a different level. Your docker and your load balancer are both inside your app.

1 more reply

foota1y ago

Amusingly, this reminds me sort of about the story of a person who joins a new company only to discover that their programming framework is intricately linked to their version control system.

toast01y ago

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

It's less terrible if you're just doing mostly stateless web stuff, but that's not my world.

In the time it takes to run terraform plan, I could have pushed erlang code to all my machines, loaded it, and (usually) confirm I fixed what I wanted to fix.

Low cost of deploy means you can do more updates which means they can be smaller which makes them easier to review.

alberth1y ago

(2016)

dang1y ago

Where do you see that? I couldn't find it.

gnabgib1y ago

It's in the URL :D But yeah, the page doesn't make it clear (and some of the embedded JS has a 2020 date suggesting it's received updates).

In the RSS feed too: Wed, 07 Dec 2016 https://kennyballou.com/index.xml

dang1y ago

Hidden in plain view! Ok, let's put 2016 above, on the assumption that the edits since then haven't been too major.

omertoast1y ago

i'm so sick of this DevOps bullshit i wonder if there is an alternative language that you can hot swap code and do all the black magic stuff while keeping the reliability and performance like Rust.

saurik1y ago

I am shocked at the idea of anyone implying that Erlang is "unreliable"... it's entire reason for existence was to step up the game on reliability.

dcsommer1y ago

Leave hot loading to local/development environments, not production deploys.

Loading configs on the fly can also have some of this risk, but it is much easier to reason about typically.

Muromec1y ago

simoncion1y ago

> I heard the quote that some 50% of the mobile traffic is handled by erlang.

Given that you can implement OTP in any language (albeit with varying degrees of difficulty), that's not surprising.

zanderwohl1y ago

I agree. I don't know much about Erlang but what I've heard seems to indicate it's used for high-uptime systems that handle errors well.

Muromec1y ago

LAC-Tech1y ago

As someone who is now in the rust world and very very sympathetic to the Erlang world... you both probably mean completely different things when you say "reliable". The contexts are just world apart.

gf0001y ago

It's not well known, but the JVM has very good hot reload support, and is a very reliable and performant platform.

Muromec1y ago

The black magic comes at the cost of not having one streamlined procedure to release stuff.

yetihehe1y ago

> The black magic comes at the cost of not having one streamlined procedure to release stuff.

> But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.

Muromec1y ago

Do you do 100% of deployments using hot reload? If yes, maybe you should share the recipy with everybody else, since consensus seems to recommend the opposite.

At the very least you will have a different procedure to upgrade the erlang itself, right?

>If you have something simple, you don't need to use erlang

I think on a spectrum of difficult things there is an area between hosting static file on rpi at home and running massivele distributed system full of long running stateful processes.

1 more reply

lamuswawir1y ago

Erlang is built for reliability. They're chasing nine nines. Everything about the BEAM is built to emphasize that, the design choices, the documentation, the recommended practices.

Erlang is not very fast, but that's not what it was built for.

Muromec1y ago

Is it really beam or just otp? Sure, beam gives you processes, network-transparent send, immutable structures and linking-monitoring thingy on top, but is what makes it good to shoot for nines?

I suspect the aura of mistycism around yet another jit vm is not that warranted

jerf1y ago

sergiotapia1y ago

we had this wonderful thing in PHP where you would just save a .php file and bada bing it was LIVE.

what happened? :D

cardanome1y ago

We still have that and it is awesome. PHP is better than ever.

thanksgiving1y ago

They took away our access after one too many outages :yay:

Muromec1y ago

Sounds like what happened with hot upgrade privileges in some erlang shops too.

1 more reply

jongjong1y ago

APIs are like database engines; they should rarely change. Making it easy to change them is an anti-pattern.

Engineers don't build bridges with replaceable pillars or skyscrapers with replaceable foundations. When aerospace engineers tried building a plane with replaceable engines, we got Boeing 737 Max...

tzmudzin1y ago

Engine replacement happens on airplanes fairly frequently. You don't want to scrap an airplane because of a single damaged turbine blade, or even keep it on the ground for longer.

https://jalopnik.com/how-airlines-decide-to-replace-jet-engi....

jongjong1y ago

p_l1y ago

737 MAX had nothing to do with replaceable engines, but with trying to run an ancient airframe with new engines but without necessary upgrades to support the new engines because of costs.

jongjong1y ago

p_l1y ago

1 more reply

j / k navigate · click thread line to collapse