Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.
Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...
One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.
However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.
Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.
One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.
Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.
And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.
GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.
RC determines an object has become garbage when its reference count goes to zero, and then collects it.
That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.
That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)
It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).
RC with borrow checking can avoid a lot of refcount increments.
Tracking GC typically needs write barriers, so it’s not free either.
I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.
> RC with borrow checking can avoid a lot of refcount increments.
That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)
Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.
Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.
Typical GC passes consist of little else but pointer chasing.
But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.
Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.
Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.
If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.
Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.
This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.
For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.
RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.
There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.
Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.
Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.
One of the problems with this is that reference counting touches all the memory right before you're done with it.
> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.
This is not a typical solution.
Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.
Garbage collection does not need to be used to collect all objects.
> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.
When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."
Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".
All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.
> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.
But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.
This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.
"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.
You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.
(Deterministic deallocation also means having to recursively free unreachable objects. That's often described as an arbitrary "pause" behavior in RC systems, but it's actually inherent in the requirement for deterministic behavior. If you don't care about determinism for some class of objects, you can amortize that pause by sending them to a separate cleanup thread.)
My own read on this is that it blurs the line with deferred collection/counting, because you could either use it to complement deferral making it cheaper, or avoid deferral because you're getting enough of the benefits of deferral by proving objects dead instead of discovering that they are dead.
The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.
This lead to one of the more entertaining C# memory leak stories, where Princeton's entry to the DARPA Grand Challenge ended up failing because every frame they detected obstacles, created a class for each, and subscribed each obstacle object to an event. They missed that the event subscription was keeping those objects alive, and every piece of tumbleweed in the desert helped leak memory until the car just stopped! https://www.codeproject.com/Articles/21253/If-Only-We-d-Used...
The codebase I work with has had many pathological crashes due to this behavior.
So basically in C# when you use += to subscribe to events, in a big system where lifetimes of objects are independent of each other, you're back to a C/C++ mindset where you should check you have a -= call for the subscribed object when the subscribing object is about to run out of scope. Else you get random crashes, when you get events delivered to an object that should have been dead.
This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.
There's WeakEventManager [0] but that's available only in "classic" dotnet framework (and in "new" dotnet but only if you're targeting Windows) since it lives in the WPF namespace. It can be used outside of it, but you still take a dependency on System.Windows.
There are some other bespoke solutions too.
There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.
[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...
If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.
Which pauses you are meaning?
Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).
Nim switched from GC to RC and it even increased performance.
Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...
When I can easily replace the deallocator (thus excluding most non-RC production GCs), I can (re)write the code to avoid GC pauses (e.g. by amortizing deallocation, destructors, etc. over several frames - perhaps in a way that returning ownership of some type and its allocations to the type's originating thread, and thus reducing contention while I'm at it!) I have done this a few times. By "coincidence", garbage generation storms causing noticable delays are suprisingly uncommon IME.
As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.
It's also been my experience that tracing or profiling any 'ole bog standard refcounted system to find performance problems is way more easy and straightforward than dealing with the utter vulgarity of deferred, ambiguously scheduled, likely on a different thread, frequently opaque garbage collection - as found in non-refcounted garbage collection systems.
So, at best, you're technically correct here - which, to be fair, is the best kind of correct. But in practice, I think it's no coincidence that refcounting systems tend to automatically and implicitly amortize their costs and avoid GC storms in just about every workload I've ever touched, and at bare minimum, reference counting avoids GC pauses... in the code I've written... by allowing me easier opportunities to fix them when they do occur. Indirectly causal rather than directly causal.
> if you read a heap-object out of a data structure, you have to increment the object's reference count.
This isn't universal. Merely accessing and dereferencing a shared_ptr in C++ won't touch the refcount, for example - you need to copy the shared_ptr to cause that. Rust's Arc/Rc need to be clone()d to touch the refcount, and the borrow checker reduces much of the need to do such a thing defensively, "in case the heap object is removed out from under me".
Of course, it can be a problem if you bake refcounting directly into language semantics and constantly churn refcounts for basic stack variables while failing to optimize said churn away. There's a reason why many GCed languages don't use reference counting to optimize the common "no cycles" case, after all - often, someone tried it out as an obvious and low hanging "optimization", and found it was a pessimization that made overall performance worse!
And even without being baked into the language, there are of course niches where heavy manipulation of long-term storage of references will be a thing, or cases where the garbage collected version can become lock-free in a context where such things actually matter - so I'll 100% agree with you on this:
> There are no hard lines; this is about performance tradeoffs, and always will be.
It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.
It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.
Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.
Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.
It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.
Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this. But isn't that an argument for reference counting?
Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...
It is less boring than building up the skills to fix the plane in mid-flight.
The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.
Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.
RAII gets you a lot of the way there.
So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.
> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?
Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)
Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.
In Python reference counting precedes tracing garbage collection.
So they didn't 'go through the hassle of implementing reference counting' after they already had tracing garbage collection. Instead they went through the hassle of implementing tracing garbage collection after they already had reference counting.
(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)
https://docs.python.org/3/reference/datamodel.html#objects-v...
So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.
But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.
Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".
Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.
My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.
The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.
The implementation was never stable enough beyond toy examples.
Naturally automating the Cocoa release/retain calls made more sense, given the constraints.
In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.
When Swift came around, having the need to easily interop with the Objective-C ecosystem naturally meant to keep the same approach, otherwise they would need the same machinery that .NET uses (RCW/CCW) to interop with COM AddRef/Release.
What Apple has is excellent marketing.
Any true GC strategy (== one that collects cycles) will fundamentally touch and allocate more memory than malloc/free, where reference counting is pretty close to malloc/free performance; it doesn’t need to touch any memory not involved. At OS scale, that’s a huge performance advantage. You can scope down the memory involved in GC using advanced, modern GC techniques, but you’re still going to be behind malloc/free in overall efficiency — cache efficiency, memory maintenance overhead, and additional memory required for bookkeeping. And reference counting will be pretty darn close to malloc/free.
[0] https://stackoverflow.com/questions/32262172/how-can-identif...
Reference counting and Garbage Collection have very clear difference: when the referenced objects are destroyed (not deallocated). In RC it happens when the count reaches zero. In GC it happens some time later. That difference is crucial for having or not having deterministic performance in your program.
Things are rarely as clear cut as we would want to believe.
Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.
Go avoids it by having scope-based "defer" rather than destructors.
They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.
Also, I fail to see the advantage of RC in case of a presumably mostly immutable language - a tracing GC is even faster there due to no changes to the object graph after allocation, making a generational approach scale very well and be almost completely done in parallel.
I see great reasons for both systems being useful, but both systems also bring their own warts.
Yes, ref counting affects cache and branch prediction, but gc is a whole complete subsystem running in parallel with your main code, constantly cleaning up after you. It will always depend upon the application which will determine what's best for that application.
Some languages lean heavily one way than the other too. Scripting with ref counting would be a nightmare, as would running a garbage collector on an 8bit micro. Since the article's talking C & C++, then of course a pro ref counting stance makes sense.
Not sure if it's entirely deterministic. A variable going out of a scope can trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen (especially if objects have destructors with side effects, your object graph is highly mutable, and your code is on a hot path). A common trick is to delay deallocation to a later time, but then again you can't be sure when your destructors will be run. Another issue is cycles, if your RC system has cycle detection, your program will behave differently depending on whether a cycle formed at runtime or not.
Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).
Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:
Copying Collector 8.9
Reference Counting Collector 21.9
Cycle-collecting Reference Counting Collector 28.7
Mark & Sweep Collector 10.1
Mark & Sweep (separate mark bits) Collector 9.6
Optimized Copying Collector 9.0
I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.
I would not be surprised to find that even a naive mark and sweep collector is faster than naive refcounting on some workloads. One obvious thing to consider is that the work is delayed, you can perform the sweeping 'as needed'. Even the marking doesn't have to run on any deterministic schedule.
The thing is that, from my naive perspective, run of the mill tracing collector algorithms are just way more advanced than your typical refcount. Most refcounting is just that - either an integer, atomic integer, or both, that gets incremented and decremented based on a number of operations applied to the underlying type. The naive approach has no delays.
Tracing GCs on the other hand, although perhaps not naive ones (could you link me info on the quickfit algorithm? I can not find anything online), might contain epochs that bump allocate in the majority of cases. That'll be particularly nice for benchmarks where allocations are likely very short lived and may actually never need to get to the mark/sweep phase. Your algorithm isn't really documented and I just really don't feel like looking at C right now.
Although naive refcounting is very common it's not the only game in town. Depending on the language you can group refcounts together - for example, imagine you have:
(assuming all fields are automatically refcounted) struct Foo { bar: Bar, baz: Baz, }
In theory, a "copy" of this type would involve 3 increments, possibly atomic increments. Each increment would also require a heap pointer dereference, and there would be no locality of those integers behind the pointers. That would be the trivial implementation.
But depending on the language you could actually flatten all of those down to 1 RC. This is language dependent, and it requires understanding how these values can be moved, referenced, etc, at compile time. You could also store all reference counts in tables associated with structures, such that you have locality when you want to read/write to multiple counters. The pointer dereference is going to be brutal so having locality there will be a nice win. I'd be curious to run your benchmarks through valgrind to see how much the refcount is just spending time on memory fetches that get invalidated in the cache immediately.
Anyway, an example of a pretty slick refcounting GC is what Pony built: https://tutorial.ponylang.io/appendices/garbage-collection.h... https://www.ponylang.io/media/papers/OGC.pdf
Pony has different types for: 1. Local, Immutable 2. Local, Mutable 3. Shared, Immutable 4. Shared, Mutable
You can read the paper where they discuss how they track local variables vs shared variables, the implementation of counter tables, etc.
So I guess to summarize:
1. The results make sense, or as much sense as anything. I'd be interested in more details on the algorithms involved and your benchmark methodology.
2. "Naive" tracing GCs are actually pretty advanced, and advanced refcount implementations are pretty scarce.
Throughput-wise, it's hard to beat naive tracing gc. The algorithms are just too simple and they don't "interfere" with "normal operations" like ref counting does. Assuming the same allocation pattern (i.e no cheating by stack allocating objects), a tracing gc would likely (again, throughput-wise) beat manual memory management too. The additional benefit tracing gives you is easy heap compaction. Thus future pointer-chasing and memory allocations will be more efficient. With ref counting, compaction is harder.
True, you could delay sweeping, but ime, marking time dominates so you don't gain much. Even with a huge heap of several gigabytes, sweeping is just a linear scan from lowest to highest address.
Quick fit is a memory allocator, see: http://www.flounder.com/memory_allocation.htm Most gcs do not keep the heap contiguous so you need it in a layer below the gc. Quick fit is the algorithm almost everyone uses and it is very good for allocating many small objects of fixed sizes (8, 16, 32, etc.). It could be swapped out with malloc/free pairs instead, at the price of some performance.
I have to disgree with naive tracing being advanced. My mark & sweep implementation is only about 50 lines and that includes comments: https://github.com/bjourne/c-examples/blob/master/libraries/... A copying collector isn't much more complicated. Neither is beyond the reach of most comp sci students. Yes, optimized multi-generational tracing collectors supporting concurrent and resumable tracing makes them very complicated. But the same is true of optimized ref counting schemes. :)
Pony looks very interesting. It looks like it is supposed to have less object churn than very dynamic languages like JavaScript which probably makes ref counting very suitable for it.
There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.
As a language, Java's not too bad. It's a bit wordy, in bad need of some syntax sugar, but it's designed to be fairly straightforward and for the most part it does its job well. I don't need a degree in language theory to get started writing it.
Java the programming style, particularly enterprise, is a horrendous over-engineered mess that schools jam down the throats of students who don't know any better. It's designed to (and fails to) enforce a common style that can be written by armies of mediocre developers plodding along inside giant enterprise codebases, so that no matter who wrote the code, some other developer in another department can figure out how to call it.
Java the JVM is a pretty nifty beast. It's made tradeoffs that means it's not always suited for every use case, but put in its element it really shines. The modern GC algos give developers options based on the program's needs. It's currently struggling to overcome some historical decisions that while good back in the old days are now holding it back.
Personally I'm very biased towards Kotlin, which gives me the benefit of the JVM without the barf that is Java. It's not the fastest-executing language out there, but for me it's a perfect balance between development speed, ecosystem of battle-tested libraries, and competitive execution speed.
Anyone who has ever shipped a C# Unity game know the pain that is the garbage collector. It’s effectively impossible to avoid frame hitches with the GC.
I’ve spent a LOT of time going way out of my way to avoid any and all garbage collects. Which somewhat defeats the purpose of using a GC-based language.
I definitely would say “GC used to be bad but now it’s good”. That tale has been spun for 30+ years at this point!
That said, it could also be a function of the same "problem" Java has in its design - Java by default boxes everything and so every memory allocation increases garbage collection pressure. Go, by using escape analysis and favoring stack allocations, doesn't have this problem and has had sub-millisecond pauses for years now.
You rarely hear people complain about Go's GC despite it being much less mature than the JVMs. But due to actual language design, Go's GC does much less work. I wouldn't say “GC used to be bad but now it’s good”, but that "the design of languages like C# and Java were too dependent on using the GC for memory allocations, and there are other GC languaged out there that utilize the GC less which can lead to greater performance"
I believe it's actually the opposite - Java has pretty simple, compact and well defined semantics. Too simple and compact for confort - a little syntatic sugar would have made the language a lot less verbose.
What’s P&L?
> Java's verbose semantics
Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
I would say that isn't that Java's semantics are that verbose, it's that the way Java is traditionally written, with every line actually 3 lines on your screen of
public function makeItalicTextBox(String actualTextIWantToBeItalic) { ItalicTextBox itb = italicTextBoxFactoryGenerator.GenerateFactory().buildItalicTextBox(actualTextIWantToBeItalic); return itb; }
I think this is actually the pernicious work of Martin's _Clean Code_ which trained a whole generation of Java coders to write nonsense one line functions with a million interior function calls like this, not anything forced by the Java language itself, but it makes for really ugly code, in my exceptionally humble but undoubtedly correct opinion.
I meant PL research as in programming language research. Not sure why my brain decided to put an & there.
>Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
`FizzBuzzEnterpriseEdition` and is a meme for a reason. That said I'm sure Java idioms written today is a lot more sane than they were around the time everyone decided to rewrite everything in Ruby. When I say Java here, I'm talking about an era of Java that probably no longer exists (the JVM also doesn't have 10 second pauses today either and is widely considered the best garbage collector ever built).
However we still don’t have something like auto/let from cpp/rust.
Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.
Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.
[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html
It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.
RC is a performance compromise.
JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.
Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.
Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.
One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!
But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.
With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.
Moving is indeed a big pain, and I'm about to back out of it for Oil :-/
----
edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:
https://news.ycombinator.com/item?id=32283641
I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.
I've never heard of a reference counting implementation that can handle memory compaction.
Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.
When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.
It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.
But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh
They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.
Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.
You can see this way on
http://sblom.github.io/openj-core/iojNoun.htm
ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.
When I worked at Facebook, which is structurally and politically incapable of building high-quality client software, I was on a small team of people tasked with making heroic technical fixes to keep the iOS app running despite literally hundreds of engineers working on the same binary incentivized to dump shoddy code to launch their product features that nobody would use as fast as possible (did you know that at one point you could order food through the Facebook app, and that a whole two digit number of people per day used this feature? Etc.)
Objective-C has ARC (automated reference counting) — every pointer is a refcounted strong reference by default unless special annotations are used. What makes it worse is that large, deep hierarchies are common, making reference cycles leaking huge amounts of memory easy to create.
For example, the view controller for a large and complicated page (referencing decoded bitmap images and other large objects) is the root of a large tree of sub-objects, some of whom want to keep a reference to the root. Now imagine the user navigates away and the reference to the view controller goes away, but nothing in the tree is deallocated due to the backlink — congratulations, you just leaked 10 MB of RAM!
It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.
To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.
It was totally egregious undefined behavior, one thread tracing the heap with no synchronization with respect to the application threads that were mutating it, but the segfaults this UB caused were so much rarer than the crashes due to OOMs that it prevented that we decided to continue running it.
This pretty much nails down what I imagine is the main difference between GC and ARC: with the former you sacrifice performance for ease of use, and with the latter you improve performance by placing some additional work on the programmers.
With allocating as dead, you're basically turning it into a tracing collector for the young generation.
It can be quite "fast."
Very cool stuff!
If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.
I'm not saying that GC is always the best choice, but this article gets the most important argument wrong:
> 1. Updating reference counts is quite expensive. > > No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all.
Yes, it is. Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".
Additionally, every modern GC does generational collection, so for the vast majority of objects, the GC literally does "nothing at all". No matter how little work it does, a RC solution has to do O(garbage) work, while a copying GC can do O(not garbage) work.
Now, that's not to say that GC is automatically better. There are trade-offs here. It depends on the workload, the amount of garbage being created, and the ratio of read to write operations.
The article says:
> I've already stated I'm not going to do benchmarks. I am aware of two orgs who've already run extensive and far-reaching experiments on this: Apple, for use in their mobile phones, and the Python project.
I can counterpoint that anecdata: Google extensively uses Java in high-performance systems, and invented a new GC-only language (Go) as a replacement for (their uses of) Python.
The right answer is to do benchmarks. Or even better yet, don't worry about this and just write your code! Outside of a vanishingly small number of specialized use cases, by the time GC vs RC becomes relevant in any meaningful way to your performance, you've already succeeded, and now you're dealing with scaling effects.
That's not true. Go was invented with the intention of replacing C++ at Google. That didn't really work out, and in practice Go became more of a replacement of Python for some applications at Google.
Also there are some indications that Go didn't gain traction necessarily on the merits of the language itself, but more on the starpower of its authors within Google.
(I mostly agree with the rest of what you wrote.)
Atomic reference count may trigger cache flush in other CPUs/stall waiting for them to do that, so it's not so minimal indeed.
Well, ok, let's go whole hog, we're collecting garbage again, and it sucks, we get all these baby objects, let's try and optimize the GC: we can keep, I dunno, a count of references to new objects, do some allocation sinking to see if we can avoid making them, put the babies in an orphanage, hey look, RC is GC, QED.
If I needed hard-realtime, I would avoid allocation entirely.
This blog post is an answer to: "Tell me you haven't learned about cache coherence without telling me you haven't learned about cache coherence."
[citation needed]
You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.
My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.
https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html
Way better than RC.
And.. often GC will be able to use area allocators, before falling back to "proper" GC allocation. Which will be a lot faster than ref counting everything.
And atomics can get very slow, I've had atmics show up regularly in the profiler.
For my project, the combination that works great so far: unbox all types, use area allocators if the compiler can guarantee the value doesn't escape, use GC for data that changes often and ref counting for data that hardly ever changes. (luckily cycles are not possible)
I have an example from early in my career where I accidentally created a memory leak in Python from a cyclic reference between a closure and a function argument
https://stackoverflow.com/questions/54726363/python-not-dele...
> 1. Updating reference counts is quite expensive.
> No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all. The comparison is the ongoing garbage collection routine, which is going to be more expensive and occur unpredictably.
First off, updating the reference count invalidates the entire cache line in which the reference count lives. For naive reference counting (which I'm assuming the author is talking about since they give no indication they're familiar with anything else), this generally means invalidating the object's cache line (and with an atomic RMW, to boot, meaning you need a bus lock or LL/SC on most systems). So right away, you have created a potentially significant cacheline contention problem between readers in multiple threads, even though you didn't intend to actually write anything. RC Immix, for example, tries to mitigate this in many creative ways, including deferring the actual reference count updates and falling back to tracing for reclamation when the count gets too high (to avoid using too many bits in the header or creating too many spurious updates).
Secondly, you know what's cheaper than an atomic increment or decrement? Not doing anything at all. The vast majority of garbage in most production tracing garbage collectors (which are, with the exception of Go's, almost exclusively generational) dies young, and never needs to be updated, copied, or deallocated (so no calling destructors and no walking a tree of children, which usually involves slow pointer chasing). Even where the object itself doesn't die young, any temporary references to the object between collections don't have to do any work at all compared to just copying a raw pointer, C style. This and bump allocation (which the author also does not engage with) are the two biggest performance wins that tracing garbage collectors typically have over reference counting ones, and solutions like RC Immix must implement similar mechanisms to even become competitive. You don't even need to go into stuff like the potential benefits of compaction, or a reduction in garbage managing code on the hot path (which are more dubious and harder to show) to understand why tracing has some formidable theoretical advantages over reference counting!
But what about in practice? Surely, the overhead of having to periodically run the tracing GC negates all these benefits? Well, bluntly--no, not even close. At least, not unless you care only about GC latency to the exclusion of everything else, or are using something fancier (like deferred RC). You can't reason backwards from "Rust and C++ are generally faster than languages with tracing GCs on optimized workloads" to conclude that reference counting is better than tracing GC--Rust and C++ both go out of their way to avoid using reference counting at all wherever possible.
None of this is secret information, incidentally. It is very easy to find. The fact that the author is apparently so incurious that they never once bothered to find out why academics talk about tracing GC's performance being superior--and the fact that it was so dismissive about it!--makes me pretty doubtful that people are going to find useful insights in the rest of the article, either.
> Secondly, you know what's cheaper... Not doing anything at all.
These techniques are on the level of resetting a stack pointer or calling `sbrk()`. Incorporating them doesn't produce more-advanced GC schemes, it just means you neglected to consider similar allowances for RC.
The line of contention is at traversing the object graph and pausing threads.
When the counter is 1, you can do anything with the object without affecting any other references.
Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.
Pretty weird argument for one of the slowest languages out there ...
With GC, you can do nothing at all. In a system with lots of garbage, you can do a GC by copying everything accessible from the GC root, then de-allocating all the garbage in a single free.
The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.
So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).
GC is way faster if there is little collection.
In memory or cache intensive applications, garbage collection as a whole can be significantly slower.
The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.
Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).
It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.
That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.
"Increments and decrements happen once and at a predictable time. The GC is running all the time and traversing the universe of GC objects. Probably with bad locality, polluting the cache, etc."
This is only the case with a mark-sweep collector, usually most of your allocations die young in the nursery. With reference counting you pay the counting cost for everything.
"In object-oriented languages, where you can literally have a pointer to something, you simply mark a reference as a weak reference if it might create a cycle."
As someone who has tried to identify memory leaks in production where someone has forgotten to "simply" mark a reference in some deep object graph as weak, this is naive.
"With an 8-byte counter you will never overflow. So...you know...just expand up to 8-bytes as needed? Usually you can get by with a few bits."
So now my "about as minimal as you can get short of nothing at all" check as an unpredictable branch in it?
"If you must overflow, e.g., you cannot afford an 8-byte counter and you need to overflow a 4-byte counter with billions of references, if you can copy it, you create a shallow copy."
I don't even know where to begin with this.
"If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?"
This probably has more to do with finalising resources and deterministic destruction than anything else.
--
Anyone who is interested in actually studying this area would probably find https://courses.cs.washington.edu/courses/cse590p/05au/p50-b... interesting. Also https://gchandbook.org/
I don't think python ever did pure mark-and-sweep ( cpython, at least, I'm sure jython and other alternate implementations have ).
My understanding was that they did pure reference counting, and kludged on a sweep GC to do cycle breaking eventually, as manually breaking cycles in early versions of python was a pain point. A quick lookup seems to indicate python1 was pure reference counting, and they added the cycle breaking when they released python2.
Apple has a nice talk on ARC [1] but it got me thinking: if I have to think about reference counting this much I might just as well manage memory all by myself.
The true joy of Garbage Collection is that you can just create objects left and right and let the computer figure out when to clean them up. It's a much more natural way of doing things and lets computers do what they're best at: taking tedious tasks out of the hands of humans.
[1]: https://developer.apple.com/videos/play/wwdc2021/10216/
It might seem that it is simply about pushing your synchronizations problems onto the GC, but the synchronization issue that GC solves internally is different and usually more coarse-grained, so in the end you have significantly smaller synchronization overhead.
Btw, GC is also often RC in disguise. What I mean is that generational garbage collectors are basically a hybrid of tracing GC and RC. See https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... for the details.
Java's GC is concurrent and runs at safe points and stops the world so it avoids this problem.
Racket probably has a state-of-the-art garbage collector. (I don't actually know, but that's where I would start looking.) Clojure obviously has the same garbage collector as any other JVM language.
In one extreme you can build Racket using the Senora GC that is conservative and not moving, that is used only for bootstraping.
On the other extreme, both of the normal versions of Racket have custom moving incremental GC. The docs with some high level explanations are in https://docs.racket-lang.org/reference/garbagecollection.htm...
The implementation details of the main "CS" version are in https://github.com/racket/racket/blob/master/racket/src/Chez... It's a replacement of the default GC of Chez Scheme that has better support for some features that are used in Racket, but I never have looked so deeply in the details.
Yikes
GC is only required if you as a programmer (or programming language) do not provide sufficient information to the compiler or runtime to understand the object graph.
You can find various algorithms in journals or whatnot written with the assumption that there's GC. Algorithms designed with this assumption may not have clear ownership for objects, and those objects my have cyclic references.
It's easy to say, "objects should have clear ownership relationships" but that kind of maxim, like most maxims, doesn't really survive if you try to apply it 100% of the time. Ownership is a tool that is very often useful for managing object lifetimes--it's not always the tool that you want.
Or do you want to manually assign memory addresses to your objects?
If you wanted performance these days, you wouldn't want to go for that architecture. It's a historically accident that they can't really free themselves from because of backwards compatibility.
Apparently it actually led to memory usage improvements in industrial projects like Redis:
The rant at the end can be boiled down to "I use confirmation bias [1] to make my engineering decisions". The OP has already decided that "GC" is slow, so I'm sure every time a runtime with it misbehaves it's "Well, that darn GC, I knew it was bad!" and every time RC misbehaves it's likely "Oh, well you should have nulled out your link here to break the cycle dummy!"
I really don't like such absolutist thinking in software dev. All of software dev is about making tradeoffs. RC and GC aren't superior or inferior to each other, they are just different and either (or both) could be valid depending on the circumstance.
Yes, this is a good point. It makes overly general claims.
E.g. a GC proponent could claim "well, tracing collectors do no work for dead objects, so they have no overhead!" Which is a good point, but not the whole story. Tracing collectors may need to repeatedly traverse live objects. Sure. But then generational collectors only traverse modified live objects that point to new objects. True. And concurrent collectors can trace using spare CPU resources, incremental collectors can break marking work up into small pauses, on and on. There are zillions of engineering tradeoffs and the GC Handbook covers most of them really well.
But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index.
If you use a language with GC, you're probably just contributing to global warming.
(Basically, your lifetimes have to be the same as your scopes, which are in a simple tree structure only.)