Reference count, don't garbage collect (opens in new tab)

(kevinlawler.com)

189 pointskcl3y ago406 comments

406 comments

This debate has gone round and round for decades. There are no hard lines; this is about performance tradeoffs, and always will be.

Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...

One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.

deterministic3y ago

I am the maintainer of a very high-performance JIT compiler for a Haskell like rules programming language used by large enterprises around the world. It uses reference counting + a global optimisation step to reduce the reference count updates to an absolute minimum. The result is compiled code that runs faster than C++ code carefully hand optimised by C++ experts over a 10 year period. There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”. Which is not common terminology. It also (BTW) scales linearly the more cores you throw at it.

brabel3y ago

You've gone from claiming reference-counting is faster than tracing GC to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible. But anyway, it's a completely fruitless discussion here unless you provide data that we can look at and scrutinize. OP hasn't provided any. You haven't provided any (and I do believe you may think you're right, but I've been in the position of being very confident of something just to be proven completely wrong by giving all my data to others to scrutinize... it's disheartening but necessary to get to the bottom of what's real). It's like the V language saying it can do memory management magically and it's much faster than Rust or whatever when they don't even have a working system yet.

3 more replies

yakubin3y ago

What happens in your language when a linked list is freed? Doesn't running its destructor (or its equivalent) take a linear amount of time relative to the length of the list?

2 more replies

pclmulqdq3y ago

The global optimization step is often what people commonly refer to as "garbage collection." Putting it inside a framework to RC as few times as possible is pretty cool.

However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.

4 more replies

viraptor3y ago

How do you collect cycles without a pause?

5 more replies

tsimionescu3y ago

free() calls that have to run for a data-dependent amount of time are more or less equivalent to GC pauses (assuming a concurrent GC that doesn't need to stop the world, like Java's). The most typical example is free()-ing a a linked list, which takes O(n) free() calls to free with a simple RC mechanism.

2 more replies

naasking3y ago

> There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”.

Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.

2 more replies

refulgentis3y ago

I'm not sure whats being asserted here, could you explain more? This sounds like you're describing a non-stopping GC, and its well understood reference counting is garbage collection. I'm not sure how the rest applies, you're correct, it is possible to write software with just malloc and free.

1 more reply

Bolkan3y ago

Pics or gtfo.

1 more reply

hamstergene3y ago

For a unifying term I prefer Automatic Memory Management.

One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.

Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.

kibwen3y ago

If you want to get even more precise, call it automatic dynamic memory management. Automatic static memory management would be something like Rust's scope-based memory reclamation via ownership.

1 more reply

Someone3y ago

> not every AMM technique works by producing garbage then collecting it

And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.

GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.

RC determines an object has become garbage when its reference count goes to zero, and then collects it.

That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.

That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)

omginternets3y ago

What are these other techniques and what can a technically-literate newcomer like myself read to get acquainted?

pjmlp3y ago

I just keep feeding them the respective CS literature.

kaba03y ago

One great example would be a C++ program that runs fast, and then just spends 10s of seconds doing “nothing” while it deallocates shared pointers’ huge object graphs at the end. They really are two sides of the same coin, with tracing GCs being actually correct (you need cycle detection for a correct RC implementation), and having much better throughput. It’s not an accident that runtimes with thousands dev hours are polishing their GC solutions instead of using a much more trivial RC.

hinkley3y ago

I don't know what the current state of the art is, but at one point the answer to GC in a realtime environment was to amortize free() across malloc(). Each allocation would clear up to 10 elements from queue of free-able memory locations. That gives a reasonably tight upper bound on worst case alloc time, and most workflows converge on a garbage-free heap. Big malloc after small free might still blow your deadlines, but big allocations after bootstrapping are generally frowned upon in realtime applications, so that's as much a social problem as a technical one.

1 more reply

pklausler3y ago

It's not marketed as a GC, but exit(2) is fast and effective when used as one.

2 more replies

im3w1l3y ago

Fun fact: if you dont do anything important in the destructors you can avoid that delay by intentionally leaking the memory. The os will clean it up when the program exits and it does a better job since it frees the pages rather than looking at your objects one by one.

1 more reply

pornel3y ago

Saying that both have pauses is a false equivalence to me.

It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

RC with borrow checking can avoid a lot of refcount increments.

Tracking GC typically needs write barriers, so it’s not free either.

kaba03y ago

> and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.

> RC with borrow checking can avoid a lot of refcount increments.

That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)

2 more replies

im3w1l3y ago

If you use refcounted pointers for everything then you'd be better off with a proper gc. But at least in the programs I see, 99% of objects are not refcounted , and that is reserved for a tiny majority of objects with especially tricky lifetimes.

ncmncm3y ago

This is the key.

Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.

Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.

Typical GC passes consist of little else but pointer chasing.

But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.

Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.

Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.

If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.

Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.

1 more reply

eru3y ago

Nowadays good compilers can handle many of these non-tricky lifetimes statically, too.

tuetuopay3y ago

> One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count.

This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.

For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.

arcticbull3y ago

RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.

There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.

klodolph3y ago

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from time to time.

Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.

Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.

One of the problems with this is that reference counting touches all the memory right before you're done with it.

> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.

This is not a typical solution.

Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.

Garbage collection does not need to be used to collect all objects.

> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.

When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."

Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".

1 more reply

tsimionescu3y ago

> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.

bitcharmer3y ago

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.

garethrowlands3y ago

Pointer chasing is expensive because fetching a random location defeats locality and therefore caches the the CPU's stream detection.

But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.

int_19h3y ago

Most GC implementations that I can think of know which bits in memory are pointers and which aren't, and only scan the pointers.

waterhouse3y ago

> Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.

"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.

kaba03y ago

> It's much more complex and expensive to do this for tracing live data

But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?

pjmlp3y ago

And so it becomes a simple tracing GC implementation, while one keeps calling it RC to feel good.

zozbot2343y ago

You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.

(Deterministic deallocation also means having to recursively free unreachable objects. That's often described as an arbitrary "pause" behavior in RC systems, but it's actually inherent in the requirement for deterministic behavior. If you don't care about determinism for some class of objects, you can amortize that pause by sending them to a separate cleanup thread.)

pkolaczk3y ago

Pausing a single thread to release memory is not what is considered a "pause". Even if pausing a single thread could be a problem, you can trivially offload releasing to another thread. A pause is when all application threads get paused. Hence, reference counting does not have a problem of pauses, and tracing often does.

hinkley3y ago

What is your read on the lack of discussion of escape analysis?

My own read on this is that it blurs the line with deferred collection/counting, because you could either use it to complement deferral making it cheaper, or avoid deferral because you're getting enough of the benefits of deferral by proving objects dead instead of discovering that they are dead.

pjmlp3y ago

Likewise it means that heap allocation on a tracing GC never took place and the object was allocated on the stack, or if small enough, on registers.

osigurdson3y ago

One thing that is particularly strange in C#, is objects can respond to events / delegates after they have gone out of scope. It can be quite a while before the object is actually collected - especially if it ends up on the large object (85K+) heap. This seems like an incredibly leaky abstraction to me. The whole "using" concept is a bit of an abomination as well = though getting better.

The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.

cpitman3y ago

That doesn't sound quite right. When a object attaches a handler to an event, that creates a reference from the event source to the event listener. Until the handler is unattached, the event listener shouldn't be collected by GC, since it is technically still alive.

This lead to one of the more entertaining C# memory leak stories, where Princeton's entry to the DARPA Grand Challenge ended up failing because every frame they detected obstacles, created a class for each, and subscribed each obstacle object to an event. They missed that the event subscription was keeping those objects alive, and every piece of tumbleweed in the desert helped leak memory until the car just stopped! https://www.codeproject.com/Articles/21253/If-Only-We-d-Used...

Matthias2473y ago

I don't get that part. If it can receive an event/delegate, it means the object needs to be referenced by another object which invokes the event. If that is the case - how would be eligible for GC at all?

2 more replies

olvy03y ago

It's a "feature" on the language, like others said below.

The codebase I work with has had many pathological crashes due to this behavior.

So basically in C# when you use += to subscribe to events, in a big system where lifetimes of objects are independent of each other, you're back to a C/C++ mindset where you should check you have a -= call for the subscribed object when the subscribing object is about to run out of scope. Else you get random crashes, when you get events delivered to an object that should have been dead.

This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.

There's WeakEventManager [0] but that's available only in "classic" dotnet framework (and in "new" dotnet but only if you're targeting Windows) since it lives in the WPF namespace. It can be used outside of it, but you still take a dependency on System.Windows.

There are some other bespoke solutions too.

There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.

[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...

[1] https://github.com/dotnet/runtime/issues/18645

admax88qqq3y ago

On the other hand such behavior can be a blessing in some situations. Maybe I do just want to hang an object off of some pubsub without having to decide the one true "owner" of the object.

If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.

torginus3y ago

Considering that C# has a major role in desktop development, and interacts with platform APIs and objects a lot as a result, these kind of weird behaviors coming from conflicting ideas about object lifetimes happen a lot - it's weird they chose a GC for the language.

citrin_ru3y ago

> reference counting has pauses while tracing garbage.

Which pauses you are meaning?

Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).

goodpoint3y ago

It's not always a tradeoff:

Nim switched from GC to RC and it even increased performance.

cwaffles3y ago

I disagree with the statement that modern memory architectures have much higher read bandwidth vs write bandwidth.

Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...

https://www.anandtech.com/show/2525/5

yxhuvud3y ago

While perhaps true, what needs to be compared here is read, vs read + write, no? Just writing isnt enough. And then we are at a factor above 2, assuming no thread contention. If there is contention, it can be a lot higher.

adrianN3y ago

While "write bandwidth" is probably not the right term, writes are more expensive because you need to update caches. If you forked you might need to copy-on-write the page before you can write to it.

MaulingMonkey3y ago

> Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true.

When I can easily replace the deallocator (thus excluding most non-RC production GCs), I can (re)write the code to avoid GC pauses (e.g. by amortizing deallocation, destructors, etc. over several frames - perhaps in a way that returning ownership of some type and its allocations to the type's originating thread, and thus reducing contention while I'm at it!) I have done this a few times. By "coincidence", garbage generation storms causing noticable delays are suprisingly uncommon IME.

As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.

It's also been my experience that tracing or profiling any 'ole bog standard refcounted system to find performance problems is way more easy and straightforward than dealing with the utter vulgarity of deferred, ambiguously scheduled, likely on a different thread, frequently opaque garbage collection - as found in non-refcounted garbage collection systems.

So, at best, you're technically correct here - which, to be fair, is the best kind of correct. But in practice, I think it's no coincidence that refcounting systems tend to automatically and implicitly amortize their costs and avoid GC storms in just about every workload I've ever touched, and at bare minimum, reference counting avoids GC pauses... in the code I've written... by allowing me easier opportunities to fix them when they do occur. Indirectly causal rather than directly causal.

> if you read a heap-object out of a data structure, you have to increment the object's reference count.

This isn't universal. Merely accessing and dereferencing a shared_ptr in C++ won't touch the refcount, for example - you need to copy the shared_ptr to cause that. Rust's Arc/Rc need to be clone()d to touch the refcount, and the borrow checker reduces much of the need to do such a thing defensively, "in case the heap object is removed out from under me".

Of course, it can be a problem if you bake refcounting directly into language semantics and constantly churn refcounts for basic stack variables while failing to optimize said churn away. There's a reason why many GCed languages don't use reference counting to optimize the common "no cycles" case, after all - often, someone tried it out as an obvious and low hanging "optimization", and found it was a pessimization that made overall performance worse!

And even without being baked into the language, there are of course niches where heavy manipulation of long-term storage of references will be a thing, or cases where the garbage collected version can become lock-free in a context where such things actually matter - so I'll 100% agree with you on this:

> There are no hard lines; this is about performance tradeoffs, and always will be.

jerf3y ago

From what I can see, the myth that needs to be debunked isn't that garbage collection is super fast and easy with no consequences, it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection. I see far more "I'm writing a web server that's going to handle six requests every hour but I'm afraid the garbage collector is going to trash my performance" than people who believe it's magically free.

It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.

It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.

Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.

code_runner3y ago

The biggest GC issues I’ve personally seen manifest were arrays of historical data that grew to tens of millions of entries and due to array storage in .net, the array was placed in the large object heap. Swapping to a linked list actually fixed the issue and the team lived to fight another day.

Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.

hinkley3y ago

I still mostly remember the day a coworker convinced me that object pooling was dead because it tears the mature generation a new one over and over.

It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.

nickbauman3y ago

Yes; I have a friend who is part of a small team that wrote a very successful stock market trading gateway in Java. Turns out the JVM's GC can be tuned in very specific ways based on your needs. And there are ways to avoid having to do JVM GC in critical areas of the code as well.

jonas213y ago

> And there are ways to avoid having to do JVM GC in critical areas of the code as well.

Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this. But isn't that an argument for reference counting?

2 more replies

PaulHoule3y ago

Generally Java has made a huge investment in the garbage collector over a long period of time to address the problems that people have in some use cases. JDK 17 is much better than JDK 8. If you were writing a GC from scratch you are not going to do as well.

1 more reply

marginalia_nu3y ago

There's several exchanges and clearing platforms running Java[1], although I'm not sure how many are still around after Nasdaq's hostile takeover.

[1] https://www.marketswiki.com/wiki/TRADExpress

kaba03y ago

The JVM GC’s are absolutely insanely good. G1 can sustain loads with heap sizes well into TERAbytes.

viktorcode3y ago

As I heard those guys write allocation-free Java code in critical paths. Nothing allocated, nothing to collect.

viktorcode3y ago

> it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection.

Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...

pjmlp3y ago

Magpie developers would use any excuse to move on.

It is less boring than building up the skills to fix the plane in mid-flight.

https://github.com/usbarmory/usbarmory/wiki

1 more reply

flohofwoe3y ago

Such a claim really needs hard data to back it up. Reference counting can be very expensive, especially if the refcount update is an atomic operation. It's hard to capture in profiling tools because the performance overhead is smeared all over the code base instead of centralized in a few hot spots, so most of the time you don't actually know how much performance you're losing because of refcounting overhead.

The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

okennedy3y ago

This. Exactly this.

Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.

arcticbull3y ago

> The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

RAII gets you a lot of the way there.

jsnell3y ago

> Basically, you attach the reference to the object graph once, and then free it when you're done with it.

So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.

> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?

Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)

Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.

eru3y ago

Yes.

In Python reference counting precedes tracing garbage collection.

So they didn't 'go through the hassle of implementing reference counting' after they already had tracing garbage collection. Instead they went through the hassle of implementing tracing garbage collection after they already had reference counting.

(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)

int_19h3y ago

It should be noted that the Python language spec explicitly says that refcounting, and the resulting deterministic cleanup, is an implementation detail of CPython, and not part of the Python language proper. Precisely so that other implementations like Jython or IronPython could just use GC without refcounting.

https://docs.python.org/3/reference/datamodel.html#objects-v...

So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.

But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.

1 more reply

twic3y ago

Details aside, using Python to make an argument about performance is a pretty bold move.

cakoose3y ago

While Python is not high performance overall, people have spent a ton of time optimizing the memory management. So while it's not definitive proof of reference counting being slow, it seems like a fair initial question.

viktorcode3y ago

> So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management.

Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".

yyyk3y ago

Reference counting is garbage collection, just a different strategy - and all these strategies tend to blur to the same methods eventually, eventually offering a latency-optimized GC or a throughput-optimized GC.

Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.

musicale3y ago

Maybe Apple just hires mediocre developers (I certainly have lots of complaints about their software and UI issues, but I would probably suspect management/priorities/schedules rather than the technical staff) but they implemented GC and Automatic Reference Counting for ObjC and found that the latter resulted in better and more consistent responsiveness, which is probably what most users care about in apps. (Apps still get compressed or paged out which can lead to annoying pauses.)

My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.

pjmlp3y ago

https://github.com/ixy-languages/ixy-languages

The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.

The implementation was never stable enough beyond toy examples.

Naturally automating the Cocoa release/retain calls made more sense, given the constraints.

In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.

When Swift came around, having the need to easily interop with the Objective-C ecosystem naturally meant to keep the same approach, otherwise they would need the same machinery that .NET uses (RCW/CCW) to interop with COM AddRef/Release.

What Apple has is excellent marketing.

1 more reply

yyyk3y ago

ARC is a perfectly respectable choice by itself. What's annoying me is ignoring the improvements other GC strategies and other languages have made in the last 20 years. There's a certain section of Apple users that thinks that every choice Apple makes is the only good way to do things.

kaba03y ago

And RC might be the correct choice for their devices (due to them not having much RAM, and latency being more important than throughput). But that is not true everywhere, I would even say that a good tracing GC is a better default.

KerrAvon3y ago

Your first paragraph is only true if malloc/free counts as GC. I have seen people try to claim that free() is just manual GC. You can say that if you want, but it renders the terminology meaningless.

Any true GC strategy (== one that collects cycles) will fundamentally touch and allocate more memory than malloc/free, where reference counting is pretty close to malloc/free performance; it doesn’t need to touch any memory not involved. At OS scale, that’s a huge performance advantage. You can scope down the memory involved in GC using advanced, modern GC techniques, but you’re still going to be behind malloc/free in overall efficiency — cache efficiency, memory maintenance overhead, and additional memory required for bookkeeping. And reference counting will be pretty darn close to malloc/free.

pjmlp3y ago

Chapter 5, https://gchandbook.org/

bruce3434343y ago

It puts that "extra memory" inside the very objects it tracks with a... reference count.

fingerlocks3y ago

What do you mean? Potential reference cycles are a compile time error in Swift. That’s the whole point of the @escaping annotation

yyyk3y ago

So this[0] wasn't current despite being edited half a year ago? I guess I'm guilty of the same mistake the post had: criticizing without understanding the state of the art on the other side (at least I didn't miss it by 20 years and on an entire subfield of Computer Science).

[0] https://stackoverflow.com/questions/32262172/how-can-identif...

1 more reply

viktorcode3y ago

Putting both under the same term is meaningless.

Reference counting and Garbage Collection have very clear difference: when the referenced objects are destroyed (not deallocated). In RC it happens when the count reaches zero. In GC it happens some time later. That difference is crucial for having or not having deterministic performance in your program.

brabel3y ago

Is the Pony language's GC a GC then? It runs at determinate times, namely when a behaviour (an actors' receive function, basically) finishes running... and because actors cannot share mutable state, and immutable values with more than one owner can be handled easily by a simple common parent actor, each actor's GC is completely independent of each other.

Things are rarely as clear cut as we would want to believe.

1 more reply

Animats3y ago

> In GC it happens some time later.

Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.

Go avoids it by having scope-based "defer" rather than destructors.

1 more reply

pjmlp3y ago

Chapter 5, https://gchandbook.org/

smasher1643y ago

For how strongly worded this article is, you'd think the author would provide some substance in their reasoning. Reference counting, even atomic, is quite expensive. Not only because it can invalidate the cache line, but depending on the architecture (looking at you x86), the memory model will deter reordering of instructions. On top of this, reference counting has a cascading effect, where one destructor causes another destructor to run, and so on. This chain of destructor calls is more or less comparable to a GC pause.

deterministic3y ago

Not my experience at all. I am maintaining a very high performance JIT compiler for a Haskell like programming language used in production at large enterprises around the world. So I am used to very carefully analyse performance. And reference counting is never the bottleneck. You might be right in theory but not in practice.

smasher1643y ago

Not sure about the characteristics of your workload, but here's a talk where a group wrote device drivers in several high-level languages, and measured their performance: https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...

They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.

2 more replies

kaba03y ago

Well, it won’t be the bottleneck itself, but it has an overhead on basically every operation, which likely won’t show up during profiling.

Also, I fail to see the advantage of RC in case of a presumably mostly immutable language - a tracing GC is even faster there due to no changes to the object graph after allocation, making a generational approach scale very well and be almost completely done in parallel.

1 more reply

kgeist3y ago

I wonder how Haskell's purity influences RC usage patterns. Are there tricks which aren't possible in an imperative language?

1 more reply

cmroanirgo3y ago

One reason you'd choose ref counting is because it's deterministic behavior, whereas you lose that granularity with gc, even if you did a gc cleanup.

I see great reasons for both systems being useful, but both systems also bring their own warts.

Yes, ref counting affects cache and branch prediction, but gc is a whole complete subsystem running in parallel with your main code, constantly cleaning up after you. It will always depend upon the application which will determine what's best for that application.

Some languages lean heavily one way than the other too. Scripting with ref counting would be a nightmare, as would running a garbage collector on an 8bit micro. Since the article's talking C & C++, then of course a pro ref counting stance makes sense.

kgeist3y ago

>One reason you'd choose ref counting is because it's deterministic behavior

Not sure if it's entirely deterministic. A variable going out of a scope can trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen (especially if objects have destructors with side effects, your object graph is highly mutable, and your code is on a hot path). A common trick is to delay deallocation to a later time, but then again you can't be sure when your destructors will be run. Another issue is cycles, if your RC system has cycle detection, your program will behave differently depending on whether a cycle formed at runtime or not.

2 more replies

eru3y ago

> Scripting with ref counting would be a nightmare, [...]

Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).

bjourne3y ago

Time to tout my own horn. I made a project comparing different types of garbage collectors (I still prefer the original terminology; both ref-counting and tracing garbage collection collects garbage, so they are both garbage collectors) a few years ago: https://github.com/bjourne/c-examples

Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:

    Copying Collector                                8.9
    Reference Counting Collector                    21.9
    Cycle-collecting Reference Counting Collector   28.7
    Mark & Sweep Collector                          10.1
    Mark & Sweep (separate mark bits) Collector      9.6
    Optimized Copying Collector                      9.0

I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.

In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.

staticassertion3y ago

This is cool, thank you. Note that my comments are those of a layman, I don't consider myself an expert on these topics, but this gave me some thoughts. Happy to learn more, would love links to blogs/ papers where I can read more.

I would not be surprised to find that even a naive mark and sweep collector is faster than naive refcounting on some workloads. One obvious thing to consider is that the work is delayed, you can perform the sweeping 'as needed'. Even the marking doesn't have to run on any deterministic schedule.

The thing is that, from my naive perspective, run of the mill tracing collector algorithms are just way more advanced than your typical refcount. Most refcounting is just that - either an integer, atomic integer, or both, that gets incremented and decremented based on a number of operations applied to the underlying type. The naive approach has no delays.

Tracing GCs on the other hand, although perhaps not naive ones (could you link me info on the quickfit algorithm? I can not find anything online), might contain epochs that bump allocate in the majority of cases. That'll be particularly nice for benchmarks where allocations are likely very short lived and may actually never need to get to the mark/sweep phase. Your algorithm isn't really documented and I just really don't feel like looking at C right now.

Although naive refcounting is very common it's not the only game in town. Depending on the language you can group refcounts together - for example, imagine you have:

(assuming all fields are automatically refcounted) struct Foo { bar: Bar, baz: Baz, }

In theory, a "copy" of this type would involve 3 increments, possibly atomic increments. Each increment would also require a heap pointer dereference, and there would be no locality of those integers behind the pointers. That would be the trivial implementation.

But depending on the language you could actually flatten all of those down to 1 RC. This is language dependent, and it requires understanding how these values can be moved, referenced, etc, at compile time. You could also store all reference counts in tables associated with structures, such that you have locality when you want to read/write to multiple counters. The pointer dereference is going to be brutal so having locality there will be a nice win. I'd be curious to run your benchmarks through valgrind to see how much the refcount is just spending time on memory fetches that get invalidated in the cache immediately.

Anyway, an example of a pretty slick refcounting GC is what Pony built: https://tutorial.ponylang.io/appendices/garbage-collection.h... https://www.ponylang.io/media/papers/OGC.pdf

Pony has different types for: 1. Local, Immutable 2. Local, Mutable 3. Shared, Immutable 4. Shared, Mutable

You can read the paper where they discuss how they track local variables vs shared variables, the implementation of counter tables, etc.

So I guess to summarize:

1. The results make sense, or as much sense as anything. I'd be interested in more details on the algorithms involved and your benchmark methodology.

2. "Naive" tracing GCs are actually pretty advanced, and advanced refcount implementations are pretty scarce.

bjourne3y ago

Let me preface by stating that I'm no expert. I created the repo a few years ago when I was self-studying gc. Then I realized the rabbit hole was much deeper than I thought and retreated. The book I read is The Garbage Collection Handbook. Quite expensive but definitely worth its price.

Throughput-wise, it's hard to beat naive tracing gc. The algorithms are just too simple and they don't "interfere" with "normal operations" like ref counting does. Assuming the same allocation pattern (i.e no cheating by stack allocating objects), a tracing gc would likely (again, throughput-wise) beat manual memory management too. The additional benefit tracing gives you is easy heap compaction. Thus future pointer-chasing and memory allocations will be more efficient. With ref counting, compaction is harder.

True, you could delay sweeping, but ime, marking time dominates so you don't gain much. Even with a huge heap of several gigabytes, sweeping is just a linear scan from lowest to highest address.

Quick fit is a memory allocator, see: http://www.flounder.com/memory_allocation.htm Most gcs do not keep the heap contiguous so you need it in a layer below the gc. Quick fit is the algorithm almost everyone uses and it is very good for allocating many small objects of fixed sizes (8, 16, 32, etc.). It could be swapped out with malloc/free pairs instead, at the price of some performance.

I have to disgree with naive tracing being advanced. My mark & sweep implementation is only about 50 lines and that includes comments: https://github.com/bjourne/c-examples/blob/master/libraries/... A copying collector isn't much more complicated. Neither is beyond the reach of most comp sci students. Yes, optimized multi-generational tracing collectors supporting concurrent and resumable tracing makes them very complicated. But the same is true of optimized ref counting schemes. :)

Pony looks very interesting. It looks like it is supposed to have less object churn than very dynamic languages like JavaScript which probably makes ref counting very suitable for it.

nemothekid3y ago

It's my theory that Java, unintentionally, did a lot of damage to P&L research. I write a lot of Rust, and while the borrow checker is great, I've come to really admire the work that was put in the Go GC even if it's not as fast Java.

There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.

RandomBK3y ago

I feel that every time we talk about Java, we should clarify between Java the language, Java the programming style, and Java the JVM.

As a language, Java's not too bad. It's a bit wordy, in bad need of some syntax sugar, but it's designed to be fairly straightforward and for the most part it does its job well. I don't need a degree in language theory to get started writing it.

Java the programming style, particularly enterprise, is a horrendous over-engineered mess that schools jam down the throats of students who don't know any better. It's designed to (and fails to) enforce a common style that can be written by armies of mediocre developers plodding along inside giant enterprise codebases, so that no matter who wrote the code, some other developer in another department can figure out how to call it.

Java the JVM is a pretty nifty beast. It's made tradeoffs that means it's not always suited for every use case, but put in its element it really shines. The modern GC algos give developers options based on the program's needs. It's currently struggling to overcome some historical decisions that while good back in the old days are now holding it back.

Personally I'm very biased towards Kotlin, which gives me the benefit of the JVM without the barf that is Java. It's not the fastest-executing language out there, but for me it's a perfect balance between development speed, ecosystem of battle-tested libraries, and competitive execution speed.

forrestthewoods3y ago

> There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses

Anyone who has ever shipped a C# Unity game know the pain that is the garbage collector. It’s effectively impossible to avoid frame hitches with the GC.

I’ve spent a LOT of time going way out of my way to avoid any and all garbage collects. Which somewhat defeats the purpose of using a GC-based language.

I definitely would say “GC used to be bad but now it’s good”. That tale has been spun for 30+ years at this point!

nemothekid3y ago

I don't know much about the C# garbage collector; and it's likely that garbage collectors are a bad fit for programs that have hard deadlines.

That said, it could also be a function of the same "problem" Java has in its design - Java by default boxes everything and so every memory allocation increases garbage collection pressure. Go, by using escape analysis and favoring stack allocations, doesn't have this problem and has had sub-millisecond pauses for years now.

You rarely hear people complain about Go's GC despite it being much less mature than the JVMs. But due to actual language design, Go's GC does much less work. I wouldn't say “GC used to be bad but now it’s good”, but that "the design of languages like C# and Java were too dependent on using the GC for memory allocations, and there are other GC languaged out there that utilize the GC less which can lead to greater performance"

2 more replies

pjmlp3y ago

Don't mix Unity's implementation, an ageing one, with the language.

ElectricalUnion3y ago

> what we really hated was Java's verbose semantics

I believe it's actually the opposite - Java has pretty simple, compact and well defined semantics. Too simple and compact for confort - a little syntatic sugar would have made the language a lot less verbose.

ncmncm3y ago

Java's fundamental problems are well beyond reach of any syntactic sugar.

1 more reply

al_mandi3y ago

Java (the JVM) doesn't really have 10 second pauses anymore. G1GC and ZGC and Shenandoah have been a thing for a while now.

pkolaczk3y ago

G1 can do 10 second pauses. I observed them many, many times. It tries not to, but there are no guarantees.

1 more reply

chrisseaton3y ago

> P&L research

What’s P&L?

> Java's verbose semantics

Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?

mandevil3y ago

Not OP, but someone who has gotten paid to write Java for several years.

I would say that isn't that Java's semantics are that verbose, it's that the way Java is traditionally written, with every line actually 3 lines on your screen of

public function makeItalicTextBox(String actualTextIWantToBeItalic) { ItalicTextBox itb = italicTextBoxFactoryGenerator.GenerateFactory().buildItalicTextBox(actualTextIWantToBeItalic); return itb; }

I think this is actually the pernicious work of Martin's _Clean Code_ which trained a whole generation of Java coders to write nonsense one line functions with a million interior function calls like this, not anything forced by the Java language itself, but it makes for really ugly code, in my exceptionally humble but undoubtedly correct opinion.

4 more replies

nemothekid3y ago

>What’s P&L?

I meant PL research as in programming language research. Not sure why my brain decided to put an & there.

>Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?

`FizzBuzzEnterpriseEdition` and is a meme for a reason. That said I'm sure Java idioms written today is a lot more sane than they were around the time everyone decided to rewrite everything in Ruby. When I say Java here, I'm talking about an era of Java that probably no longer exists (the JVM also doesn't have 10 second pauses today either and is widely considered the best garbage collector ever built).

1 more reply

lumost3y ago

It’s gotten better over the years. Java originally did not have generics, then there was the factory/uml everything crowd. Recently modern Java has been evolving towards ergonomics with streams/loom/guice/Lombok etc.

However we still don’t have something like auto/let from cpp/rust.

2 more replies

slavboj3y ago

People have been managing garbage collection schedules for decades now. It's quite possible for many systems to have completely deterministic performance, with the allocation/deallocation performance made extremely fast, gc restricted to certain times or a known constant overhead, etc. Ironically, from a programming perspective it's incredibly easy in a language like Java to see exactly what allocates and bound those cases.

Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.

Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.

imtringued3y ago

Not only does the author ignore the huge progress in conventional garbage collected languages like Java, he also dismisses GC as inherently flawed despite the fact that the common strategy of only having one heap per application has nothing to do with garbage collection. In Pony each actor has its own isolated heap which means the garbage collector will only interrupt a tiny portion of the program for a much shorter period of time. Hence the concept of a stop the world pause is orthogonal to whether you have a GC or not. One could build a stop the world pause into an RC system through cycle detection if desired.

Waterluvian3y ago

I’m out of my league so this may be dumb, but does any language or VM or whatnot have a combined system where each thread has its own heap, and you can talk by passing messages, but they also have a common heap for larger data that’s too expensive to pass around, but at the cost that you have to be much more careful with lifetimes or have to manage it manually or something?

tkhattra3y ago

In the Erlang VM, each Erlang "process" has its own garbage collected heap [1]. There's also an ETS module for more efficiently storing and sharing large amounts of data, but data stored in ETS is not automatically garbage collected [2].

[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html

jefftk3y ago

If you squint at it right you could say C works that way: you have many processes each with their own heap and they can pass messages, but if they want larger data that's too expensive to pass around you can use shared memory.

1 more reply

Jweb_Guru3y ago

I believe Nim works this way, among others.

jeffmurphy3y ago

I believe Erlang works this way.

Bolkan3y ago

Dart

viktorcode3y ago

> he also dismisses GC as inherently flawed

It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.

RC is a performance compromise.

ridiculous_fish3y ago

There's a lot of discussion of comparative performance, but most software isn't performance sensitive so it just doesn't matter. But there's another major facet: FFIs. The choice of memory management has huge implications for how you structure your FFI.

JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.

Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.

Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.

One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!

chubot3y ago

I agree that the API and interop with C/C++ is a huge issue and something I haven't seen good articles on.

But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.

With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.

Moving is indeed a big pain, and I'm about to back out of it for Oil :-/

----

edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:

https://news.ycombinator.com/item?id=32283641

I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.

kaba03y ago

It’s a good point that it is a topic that doesn’t get enough coverage, but let’s just add that it has good solutions for most use cases: GCs can pin objects that might be used from some other language.

knome3y ago

Reference counting can also have unpredictable hits if you release any large data structures. Whoever drops the last reference suddenly gets to sit through the entire deep set of items to release ( unless you can hand off the release cascade to a background thread ).

I've never heard of a reference counting implementation that can handle memory compaction.

Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.

When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.

eru3y ago

> I've never heard of a reference counting implementation that can handle memory compaction.

It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.

But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh

They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.

mamcx3y ago

> hits if you release any large data structures.

Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.

You can see this way on

http://sblom.github.io/openj-core/iojNoun.htm

ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.

kaba03y ago

How is that not the exact same for tracing GC?

1 more reply

umanwizard3y ago

Case 3.: You don’t “want” the constraints of Case 2, but are in practice forced into them due to a huge, poorly-educated developer base being incapable of writing correct refcounted code or even knowing what a weak pointer is.

When I worked at Facebook, which is structurally and politically incapable of building high-quality client software, I was on a small team of people tasked with making heroic technical fixes to keep the iOS app running despite literally hundreds of engineers working on the same binary incentivized to dump shoddy code to launch their product features that nobody would use as fast as possible (did you know that at one point you could order food through the Facebook app, and that a whole two digit number of people per day used this feature? Etc.)

Objective-C has ARC (automated reference counting) — every pointer is a refcounted strong reference by default unless special annotations are used. What makes it worse is that large, deep hierarchies are common, making reference cycles leaking huge amounts of memory easy to create.

For example, the view controller for a large and complicated page (referencing decoded bitmap images and other large objects) is the root of a large tree of sub-objects, some of whom want to keep a reference to the root. Now imagine the user navigates away and the reference to the view controller goes away, but nothing in the tree is deallocated due to the backlink — congratulations, you just leaked 10 MB of RAM!

It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.

To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.

It was totally egregious undefined behavior, one thread tracing the heap with no synchronization with respect to the application threads that were mutating it, but the segfaults this UB caused were so much rarer than the crashes due to OOMs that it prevented that we decided to continue running it.

viktorcode3y ago

> It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.

This pretty much nails down what I imagine is the main difference between GC and ARC: with the former you sacrifice performance for ease of use, and with the latter you improve performance by placing some additional work on the programmers.

carry_bit3y ago

You can optimize reference counting: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23...

With allocating as dead, you're basically turning it into a tracing collector for the young generation.

hayley-patton3y ago

https://users.cecs.anu.edu.au/~steveb/pubs/papers/lxr-pldi-2... is the most recent publication in this lineage of high-performance RC systems.

agentultra3y ago

A paper I quite enjoyed on automatic reference counting for pure, immutable functional programming: https://arxiv.org/abs/1908.05647

It can be quite "fast."

miloignis3y ago

Indeed, and this line of research has been continuing to improve in follow on work on Perceus in Koka: https://xnning.github.io/papers/perceus.pdf and https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

Very cool stuff!

eru3y ago

If you have pure, immutable and strict, you can't create cycles. (That's what Erlang does for example.) That makes a lot of memory management techniques much simpler. Both tracing garbage collection and reference counting.

If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.

dpryden3y ago

This article is naive to the point of being flat-out wrong, since it makes extremely naive assumptions about how a garbage collector works. This is basically another C++-centric programmer saying that smart pointers work better than the Boehm GC -- which is completely true but also completely misleading.

I'm not saying that GC is always the best choice, but this article gets the most important argument wrong:

> 1. Updating reference counts is quite expensive. > > No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all.

Yes, it is. Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".

Additionally, every modern GC does generational collection, so for the vast majority of objects, the GC literally does "nothing at all". No matter how little work it does, a RC solution has to do O(garbage) work, while a copying GC can do O(not garbage) work.

Now, that's not to say that GC is automatically better. There are trade-offs here. It depends on the workload, the amount of garbage being created, and the ratio of read to write operations.

The article says:

> I've already stated I'm not going to do benchmarks. I am aware of two orgs who've already run extensive and far-reaching experiments on this: Apple, for use in their mobile phones, and the Python project.

I can counterpoint that anecdata: Google extensively uses Java in high-performance systems, and invented a new GC-only language (Go) as a replacement for (their uses of) Python.

The right answer is to do benchmarks. Or even better yet, don't worry about this and just write your code! Outside of a vanishingly small number of specialized use cases, by the time GC vs RC becomes relevant in any meaningful way to your performance, you've already succeeded, and now you're dealing with scaling effects.

eru3y ago

> [...] and invented a new GC-only language (Go) as a replacement for (their uses of) Python.

That's not true. Go was invented with the intention of replacing C++ at Google. That didn't really work out, and in practice Go became more of a replacement of Python for some applications at Google.

Also there are some indications that Go didn't gain traction necessarily on the merits of the language itself, but more on the starpower of its authors within Google.

(I mostly agree with the rest of what you wrote.)

kgeist3y ago

>Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".

Atomic reference count may trigger cache flush in other CPUs/stall waiting for them to do that, so it's not so minimal indeed.

samatman3y ago

Definitely use reference counting, it's better! Now you're avoiding cyclic data structures and it sucks, so maybe just for a few objects we'll put them on a linked list, maybe mark them, definitely sweep from time to time to see if anything is unreachable. I'm told there's an algorithm by Boehm.

Well, ok, let's go whole hog, we're collecting garbage again, and it sucks, we get all these baby objects, let's try and optimize the GC: we can keep, I dunno, a count of references to new objects, do some allocation sinking to see if we can avoid making them, put the babies in an orphanage, hey look, RC is GC, QED.

habibur3y ago

It's interesting how we have come full circle from "Reference counting is the worst of two worlds [manual and GC] and will always be slower" to now "Well, we all know it's actually faster." in like 10 years.

flohofwoe3y ago

Except that "refcounting is faster than a GC" is mostly a myth, both are equally bad if predictable performance matters.

klodolph3y ago

Looking at metrics from Go's garbage collector, what else do you want? GC pauses are damn low, and you'll see numbers in the sub 500μs range.

If I needed hard-realtime, I would avoid allocation entirely.

1 more reply

pclmulqdq3y ago

Its actually usually slower than both manual memory management and GC. It's only coming back now because people are finally learning how to make memory allocations large and rare.

This blog post is an answer to: "Tell me you haven't learned about cache coherence without telling me you haven't learned about cache coherence."

OskarS3y ago

> Its actually usually slower than both manual memory management and GC

[citation needed]

You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.

My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.

adgjlsfhk13y ago

The blog post is largely incoherent for several reasons.

1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.

2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.

2 more replies

rwmj3y ago

Hiring would certainly be a lot easier if more people were to make bold, completely wrong blog postings like these. I could immediately give my negative recommendation without the time and hassle of a phone interview.

2 more replies

hinkley3y ago

Has anyone done a good paper on how memory bank affinity for processors affects these costs?

spullara3y ago

This is a modern GC:

https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html

Way better than RC.

viktorcode3y ago

See that last graph with memory overhead? Not everyone's definition of "better" allows for that.

danybittel3y ago

He fails to mention that Apple added support for ref counting in silicon.

And.. often GC will be able to use area allocators, before falling back to "proper" GC allocation. Which will be a lot faster than ref counting everything.

And atomics can get very slow, I've had atmics show up regularly in the profiler.

For my project, the combination that works great so far: unbox all types, use area allocators if the compiler can guarantee the value doesn't escape, use GC for data that changes often and ref counting for data that hardly ever changes. (luckily cycles are not possible)

assbuttbuttass3y ago

In practice, I don't see any reference counting approaches that do cycle detection.

I have an example from early in my career where I accidentally created a memory leak in Python from a cyclic reference between a closure and a function argument

https://stackoverflow.com/questions/54726363/python-not-dele...

malkia3y ago

Clearly this person hasn't tried how this works on NUMA cpus. it's quite expensive to do these atomic inc/decs there, or even without NUMA... caches must be synced and flushed because of this.

brobinson3y ago

Yeah, surprised no one is mentioning this. (A)RC is awesome... for flushing caches. :-(

Jweb_Guru3y ago

I'm sorry, but this is a very poorly reasoned article that does not engage with any of the serious work that's been underway to get reference counting competitive with tracing GC. This is evident from the very first point:

> 1. Updating reference counts is quite expensive.

> No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all. The comparison is the ongoing garbage collection routine, which is going to be more expensive and occur unpredictably.

First off, updating the reference count invalidates the entire cache line in which the reference count lives. For naive reference counting (which I'm assuming the author is talking about since they give no indication they're familiar with anything else), this generally means invalidating the object's cache line (and with an atomic RMW, to boot, meaning you need a bus lock or LL/SC on most systems). So right away, you have created a potentially significant cacheline contention problem between readers in multiple threads, even though you didn't intend to actually write anything. RC Immix, for example, tries to mitigate this in many creative ways, including deferring the actual reference count updates and falling back to tracing for reclamation when the count gets too high (to avoid using too many bits in the header or creating too many spurious updates).

Secondly, you know what's cheaper than an atomic increment or decrement? Not doing anything at all. The vast majority of garbage in most production tracing garbage collectors (which are, with the exception of Go's, almost exclusively generational) dies young, and never needs to be updated, copied, or deallocated (so no calling destructors and no walking a tree of children, which usually involves slow pointer chasing). Even where the object itself doesn't die young, any temporary references to the object between collections don't have to do any work at all compared to just copying a raw pointer, C style. This and bump allocation (which the author also does not engage with) are the two biggest performance wins that tracing garbage collectors typically have over reference counting ones, and solutions like RC Immix must implement similar mechanisms to even become competitive. You don't even need to go into stuff like the potential benefits of compaction, or a reduction in garbage managing code on the hot path (which are more dubious and harder to show) to understand why tracing has some formidable theoretical advantages over reference counting!

But what about in practice? Surely, the overhead of having to periodically run the tracing GC negates all these benefits? Well, bluntly--no, not even close. At least, not unless you care only about GC latency to the exclusion of everything else, or are using something fancier (like deferred RC). You can't reason backwards from "Rust and C++ are generally faster than languages with tracing GCs on optimized workloads" to conclude that reference counting is better than tracing GC--Rust and C++ both go out of their way to avoid using reference counting at all wherever possible.

None of this is secret information, incidentally. It is very easy to find. The fact that the author is apparently so incurious that they never once bothered to find out why academics talk about tracing GC's performance being superior--and the fact that it was so dismissive about it!--makes me pretty doubtful that people are going to find useful insights in the rest of the article, either.

kclOP3y ago

GC researchers insist on conflating GC with all of automatic memory management. The public doesn't do this and neither does the article.

> Secondly, you know what's cheaper... Not doing anything at all.

These techniques are on the level of resetting a stack pointer or calling `sbrk()`. Incorporating them doesn't produce more-advanced GC schemes, it just means you neglected to consider similar allowances for RC.

The line of contention is at traversing the object graph and pausing threads.

Jweb_Guru3y ago

Of course I have considered similar allowances for Rc (this has nothing to do with stack allocation by the way, I'm hoping this is not a misunderstanding on your part). I referenced RC Immix throughout the post. It is extremely clear that the author of the article did not consider such allowances because they do not see that there is a problem. Even if the author had done so, there is a big difference between saying "whatever, I could do that same thing for Rc!" and actually doing it--a hugely nontrivial one that has not been fully bridged until quite recently. These kinds of techniques are still not used in production languages with reference counting garbage collection, including Swift.

benibela3y ago

An advantage of RC is that you can also use it to verify ownership.

When the counter is 1, you can do anything with the object without affecting any other references.

Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.

dgan3y ago

>> "The Python case is more inarguable. If GC is so good, why wouldn't Python just garbage collect everything,... ? It is because RC outperforms garbage collecting in all these standard cases"

Pretty weird argument for one of the slowest languages out there ...

cosmotic3y ago

> This is about as minimal as you can get short of nothing at all.

With GC, you can do nothing at all. In a system with lots of garbage, you can do a GC by copying everything accessible from the GC root, then de-allocating all the garbage in a single free.

pizlonator3y ago

Atomic inc/dec is hella expensive relative to not doing it. It’s true that CPUs optimize it, but not enough to make it free. RC as a replacement for GC means doing a lot more of this expensive operation - which the GC will do basically zero of in steady state - so this means RC just costs more. Like 2x slowdown more.

The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.

So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).

manuelabeledo3y ago

> So, GC is usually faster.

GC is way faster if there is little collection.

In memory or cache intensive applications, garbage collection as a whole can be significantly slower.

pizlonator3y ago

GC is faster even if you collect a lot. GCs create better cache locality especially for recently allocated objects, and their cache behavior is not generally worse than malloc (but there are many GCs and many mallocs and some try harder than others to make caches happy).

The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.

Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).

It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.

That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.

pjmlp3y ago

Another RC advocate that misses the point about RC being a GC algorithm from CS point of view.

https://gchandbook.org/

byefruit3y ago

This article provides very little evidence for it's claims and seems to only have a superficial understanding of modern GCs.

"Increments and decrements happen once and at a predictable time. The GC is running all the time and traversing the universe of GC objects. Probably with bad locality, polluting the cache, etc."

This is only the case with a mark-sweep collector, usually most of your allocations die young in the nursery. With reference counting you pay the counting cost for everything.

"In object-oriented languages, where you can literally have a pointer to something, you simply mark a reference as a weak reference if it might create a cycle."

As someone who has tried to identify memory leaks in production where someone has forgotten to "simply" mark a reference in some deep object graph as weak, this is naive.

"With an 8-byte counter you will never overflow. So...you know...just expand up to 8-bytes as needed? Usually you can get by with a few bits."

So now my "about as minimal as you can get short of nothing at all" check as an unpredictable branch in it?

"If you must overflow, e.g., you cannot afford an 8-byte counter and you need to overflow a 4-byte counter with billions of references, if you can copy it, you create a shallow copy."

I don't even know where to begin with this.

"If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?"

This probably has more to do with finalising resources and deterministic destruction than anything else.

Anyone who is interested in actually studying this area would probably find https://courses.cs.washington.edu/courses/cse590p/05au/p50-b... interesting. Also https://gchandbook.org/

knome3y ago

>If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do

I don't think python ever did pure mark-and-sweep ( cpython, at least, I'm sure jython and other alternate implementations have ).

My understanding was that they did pure reference counting, and kludged on a sweep GC to do cycle breaking eventually, as manually breaking cycles in early versions of python was a pain point. A quick lookup seems to indicate python1 was pure reference counting, and they added the cycle breaking when they released python2.

cogman103y ago

That's my recollection as well (though I've not found evidence to support it). In fact, IIRC, as part of the python performance thing one of the topics was adding a proper generational GC into python for objects that don't refer to pinned memory.

omginternets3y ago

Whenever I read someting like this, I wonder what kind of programming the author is doing. I’m getting a strong whiff of embedded and/or real-time systems.

UltraViolence3y ago

Reference counting forces the developer to think about memory management.

Apple has a nice talk on ARC [1] but it got me thinking: if I have to think about reference counting this much I might just as well manage memory all by myself.

The true joy of Garbage Collection is that you can just create objects left and right and let the computer figure out when to clean them up. It's a much more natural way of doing things and lets computers do what they're best at: taking tedious tasks out of the hands of humans.

[1]: https://developer.apple.com/videos/play/wwdc2021/10216/

dfox3y ago

One great advantage of garbage collection is that it removes need for thread synchronization in cases where is it only needed to make sure that object jou are going to call free/decref on is not in use in another thread. Corollaly to that GC is the thing that you need for many lock-free data structures to be practical and not of only theoretical interest.

It might seem that it is simply about pushing your synchronizations problems onto the GC, but the synchronization issue that GC solves internally is different and usually more coarse-grained, so in the end you have significantly smaller synchronization overhead.

melony3y ago

You are in for a fun time when you need to make circular data structures with ARC.

freecodyx3y ago

What if programming languages start offering both? In my opinion RC is GC in disguise. At least for example Golang GC has the merit to run in a separate thread(still has to stop the world when reclaiming memory back, and the memory allocator model is helping achieve great GC perfs).

eru3y ago

Python does both reference counting and garbage collection.

Btw, GC is also often RC in disguise. What I mean is that generational garbage collectors are basically a hybrid of tracing GC and RC. See https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... for the details.

samsquire3y ago

My understanding of Python's Global Interpreter Lock is that reference counting cannot be done efficiently between threads, so we cannot remove the GIL with reference counting

Java's GC is concurrent and runs at safe points and stops the world so it avoids this problem.

nikolay3y ago

When I wrote a Lisp interpreter in the '90s, that's how I did it and I'm ashamed to admit that I have no idea how modern GC is done - I've always assumed (naively!) that it was like Lisp's!

eru3y ago

Modern Lisps likely have modern GCs. I mean there's no reason for them not to.

Racket probably has a state-of-the-art garbage collector. (I don't actually know, but that's where I would start looking.) Clojure obviously has the same garbage collector as any other JVM language.

gus_massa3y ago

Racket has like 5 GC, perhaps more.

In one extreme you can build Racket using the Senora GC that is conservative and not moving, that is used only for bootstraping.

On the other extreme, both of the normal versions of Racket have custom moving incremental GC. The docs with some high level explanations are in https://docs.racket-lang.org/reference/garbagecollection.htm...

The implementation details of the main "CS" version are in https://github.com/racket/racket/blob/master/racket/src/Chez... It's a replacement of the default GC of Chez Scheme that has better support for some features that are used in Racket, but I never have looked so deeply in the details.

1 more reply

_8j503y ago

Pardon the ignorance but I thought refcount was a GC strategy?

viktorcode3y ago

The academia tends to call RC a form of GC. For programmers experienced in languages with manual memory management those are very different beasts.

exabrial3y ago

> I've already stated I'm not going to do benchmarks.

Yikes

glouwbug3y ago

Isn’t garbage collection needed to solve circular reference counts?

arcticbull3y ago

Nope, you can just mark the back-reference as weak.

GC is only required if you as a programmer (or programming language) do not provide sufficient information to the compiler or runtime to understand the object graph.

klodolph3y ago

It's not always obvious to know which reference to mark as weak, and there's not necessarily a clear indication of which reference is a back-reference.

You can find various algorithms in journals or whatnot written with the assumption that there's GC. Algorithms designed with this assumption may not have clear ownership for objects, and those objects my have cyclic references.

It's easy to say, "objects should have clear ownership relationships" but that kind of maxim, like most maxims, doesn't really survive if you try to apply it 100% of the time. Ownership is a tool that is very often useful for managing object lifetimes--it's not always the tool that you want.

2 more replies

amiga12003y ago

Just allocate then deallocate manually, use neither auto methods.

eru3y ago

What do you mean by 'manually'? Malloc and free still do lots of work.

Or do you want to manually assign memory addresses to your objects?

mirekrusin3y ago

What about - garbage collect by reference counting, like Python?

eru3y ago

Python has reference counting for historical reasons, and added tracing garbage collection for dealing with cycles.

If you wanted performance these days, you wouldn't want to go for that architecture. It's a historically accident that they can't really free themselves from because of backwards compatibility.

jayd163y ago

Is there such a thing as a compacting RC?

hayley-patton3y ago

Backup compaction can be useful, like backup tracing can be, but you can also use all the initial increments in a coalescing RC collector to determine which pointers need to be fixed up for copying, without tracing. See http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla... pages 8 and 9 on "Defragmentation with Opportunistic Copying" e.g.

rtfeldman3y ago

There was a great talk at Strange Loop about a drop-in malloc replacement which compacts.

Apparently it actually led to memory usage improvements in industrial projects like Redis:

https://youtu.be/c1UBJbfR-H0

eru3y ago

See https://github.com/plasma-umass/Mesh for the code and a link to the paper.

titzer3y ago

I read most the article and it's just a lot of the same tired old arguments and an extremely simplified worldview of both GC and reference counting. I wish I had the author's address, because I'd like to mail them a copy of the Garbage Collection Handbook. They clearly have a very naive view of both garbage collection and reference counting. And there isn't a single dang measurement anywhere, so this can be completely dismissed IMHO.

cogman103y ago

Agreed. What I particularly disliked is how absent of nuance it is. RC is a form of GC and all GC algorithms make tradeoffs. RC trades throughput for latency. Compacting mark and sweep trade latency (and usually memory) for throughput.

The rant at the end can be boiled down to "I use confirmation bias [1] to make my engineering decisions". The OP has already decided that "GC" is slow, so I'm sure every time a runtime with it misbehaves it's "Well, that darn GC, I knew it was bad!" and every time RC misbehaves it's likely "Oh, well you should have nulled out your link here to break the cycle dummy!"

I really don't like such absolutist thinking in software dev. All of software dev is about making tradeoffs. RC and GC aren't superior or inferior to each other, they are just different and either (or both) could be valid depending on the circumstance.

[1] https://en.wikipedia.org/wiki/Confirmation_bias

titzer3y ago

> absent of nuance

Yes, this is a good point. It makes overly general claims.

E.g. a GC proponent could claim "well, tracing collectors do no work for dead objects, so they have no overhead!" Which is a good point, but not the whole story. Tracing collectors may need to repeatedly traverse live objects. Sure. But then generational collectors only traverse modified live objects that point to new objects. True. And concurrent collectors can trace using spare CPU resources, incremental collectors can break marking work up into small pauses, on and on. There are zillions of engineering tradeoffs and the GC Handbook covers most of them really well.

bitwize3y ago

Boy, I can't wait for theangeryemacsshibe (posts here as hayley-patton) to tear into this one.

But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index.

If you use a language with GC, you're probably just contributing to global warming.

kaba03y ago

Why not just write embedded programs with fixed size memory allocation then if we are that okay with restricting the programs we write?

bitwize3y ago

Because maybe we're not okay with restricting the programs we write that much.

1 more reply

eru3y ago

Doesn't RAII only work when your lifetimes are in a nested hierarchy?

(Basically, your lifetimes have to be the same as your scopes, which are in a simple tree structure only.)

j / k navigate · click thread line to collapse

406 comments

shwestrick3y ago

This debate has gone round and round for decades. There are no hard lines; this is about performance tradeoffs, and always will be.

deterministic3y ago

brabel3y ago

3 more replies

yakubin3y ago

What happens in your language when a linked list is freed? Doesn't running its destructor (or its equivalent) take a linear amount of time relative to the length of the list?

2 more replies

pclmulqdq3y ago

The global optimization step is often what people commonly refer to as "garbage collection." Putting it inside a framework to RC as few times as possible is pretty cool.

4 more replies

viraptor3y ago

How do you collect cycles without a pause?

5 more replies

tsimionescu3y ago

2 more replies

naasking3y ago

> There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”.

Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.

2 more replies

refulgentis3y ago

1 more reply

Bolkan3y ago

Pics or gtfo.

1 more reply

hamstergene3y ago

For a unifying term I prefer Automatic Memory Management.

One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.

Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.

kibwen3y ago

If you want to get even more precise, call it automatic dynamic memory management. Automatic static memory management would be something like Rust's scope-based memory reclamation via ownership.

1 more reply

Someone3y ago

> not every AMM technique works by producing garbage then collecting it

And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.

GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.

RC determines an object has become garbage when its reference count goes to zero, and then collects it.

omginternets3y ago

What are these other techniques and what can a technically-literate newcomer like myself read to get acquainted?

pjmlp3y ago

I just keep feeding them the respective CS literature.

kaba03y ago

hinkley3y ago

1 more reply

pklausler3y ago

It's not marketed as a GC, but exit(2) is fast and effective when used as one.

2 more replies

im3w1l3y ago

1 more reply

pornel3y ago

Saying that both have pauses is a false equivalence to me.

RC with borrow checking can avoid a lot of refcount increments.

Tracking GC typically needs write barriers, so it’s not free either.

kaba03y ago

> and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

> RC with borrow checking can avoid a lot of refcount increments.

2 more replies

im3w1l3y ago

ncmncm3y ago

This is the key.

Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.

Typical GC passes consist of little else but pointer chasing.

But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.

1 more reply

eru3y ago

Nowadays good compilers can handle many of these non-tricky lifetimes statically, too.

tuetuopay3y ago

arcticbull3y ago

RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

klodolph3y ago

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from time to time.

Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.

Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.

One of the problems with this is that reference counting touches all the memory right before you're done with it.

This is not a typical solution.

Garbage collection does not need to be used to collect all objects.

When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."

Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".

1 more reply

tsimionescu3y ago

bitcharmer3y ago

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.

garethrowlands3y ago

Pointer chasing is expensive because fetching a random location defeats locality and therefore caches the the CPU's stream detection.

int_19h3y ago

Most GC implementations that I can think of know which bits in memory are pointers and which aren't, and only scan the pointers.

waterhouse3y ago

> Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.

kaba03y ago

> It's much more complex and expensive to do this for tracing live data

But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?

pjmlp3y ago

And so it becomes a simple tracing GC implementation, while one keeps calling it RC to feel good.

zozbot2343y ago

You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.

pkolaczk3y ago

hinkley3y ago

What is your read on the lack of discussion of escape analysis?

pjmlp3y ago

Likewise it means that heap allocation on a tracing GC never took place and the object was allocated on the stack, or if small enough, on registers.

osigurdson3y ago

The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.

cpitman3y ago

Matthias2473y ago

2 more replies

olvy03y ago

It's a "feature" on the language, like others said below.

The codebase I work with has had many pathological crashes due to this behavior.

This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.

There are some other bespoke solutions too.

There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.

[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...

[1] https://github.com/dotnet/runtime/issues/18645

admax88qqq3y ago

On the other hand such behavior can be a blessing in some situations. Maybe I do just want to hang an object off of some pubsub without having to decide the one true "owner" of the object.

If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.

torginus3y ago

citrin_ru3y ago

> reference counting has pauses while tracing garbage.

Which pauses you are meaning?

Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).

goodpoint3y ago

It's not always a tradeoff:

Nim switched from GC to RC and it even increased performance.

cwaffles3y ago

I disagree with the statement that modern memory architectures have much higher read bandwidth vs write bandwidth.

Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...

https://www.anandtech.com/show/2525/5

yxhuvud3y ago

adrianN3y ago

While "write bandwidth" is probably not the right term, writes are more expensive because you need to update caches. If you forked you might need to copy-on-write the page before you can write to it.

MaulingMonkey3y ago

> Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true.

As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.

> if you read a heap-object out of a data structure, you have to increment the object's reference count.

> There are no hard lines; this is about performance tradeoffs, and always will be.

jerf3y ago

code_runner3y ago

Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.

hinkley3y ago

I still mostly remember the day a coworker convinced me that object pooling was dead because it tears the mature generation a new one over and over.

It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.

nickbauman3y ago

jonas213y ago

> And there are ways to avoid having to do JVM GC in critical areas of the code as well.

2 more replies

PaulHoule3y ago

1 more reply

marginalia_nu3y ago

There's several exchanges and clearing platforms running Java[1], although I'm not sure how many are still around after Nasdaq's hostile takeover.

[1] https://www.marketswiki.com/wiki/TRADExpress

kaba03y ago

The JVM GC’s are absolutely insanely good. G1 can sustain loads with heap sizes well into TERAbytes.

viktorcode3y ago

As I heard those guys write allocation-free Java code in critical paths. Nothing allocated, nothing to collect.

viktorcode3y ago

Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...

pjmlp3y ago

Magpie developers would use any excuse to move on.

It is less boring than building up the skills to fix the plane in mid-flight.

https://github.com/usbarmory/usbarmory/wiki

1 more reply

flohofwoe3y ago

The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

okennedy3y ago

This. Exactly this.

arcticbull3y ago

> The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

RAII gets you a lot of the way there.

jsnell3y ago

> Basically, you attach the reference to the object graph once, and then free it when you're done with it.

eru3y ago

Yes.

In Python reference counting precedes tracing garbage collection.

(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)

int_19h3y ago

https://docs.python.org/3/reference/datamodel.html#objects-v...

So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.

But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.

1 more reply

twic3y ago

Details aside, using Python to make an argument about performance is a pretty bold move.

cakoose3y ago

viktorcode3y ago

Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".

yyyk3y ago

musicale3y ago

My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.

pjmlp3y ago

https://github.com/ixy-languages/ixy-languages

The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.

The implementation was never stable enough beyond toy examples.

Naturally automating the Cocoa release/retain calls made more sense, given the constraints.

In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.

What Apple has is excellent marketing.

1 more reply

yyyk3y ago

kaba03y ago

KerrAvon3y ago

Your first paragraph is only true if malloc/free counts as GC. I have seen people try to claim that free() is just manual GC. You can say that if you want, but it renders the terminology meaningless.

pjmlp3y ago

Chapter 5, https://gchandbook.org/

bruce3434343y ago

It puts that "extra memory" inside the very objects it tracks with a... reference count.

fingerlocks3y ago

What do you mean? Potential reference cycles are a compile time error in Swift. That’s the whole point of the @escaping annotation

yyyk3y ago

[0] https://stackoverflow.com/questions/32262172/how-can-identif...

1 more reply

viktorcode3y ago

Putting both under the same term is meaningless.

brabel3y ago

Things are rarely as clear cut as we would want to believe.

1 more reply

Animats3y ago

> In GC it happens some time later.

Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.

Go avoids it by having scope-based "defer" rather than destructors.

1 more reply

pjmlp3y ago

Chapter 5, https://gchandbook.org/

smasher1643y ago

deterministic3y ago

smasher1643y ago

They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.

2 more replies

kaba03y ago

Well, it won’t be the bottleneck itself, but it has an overhead on basically every operation, which likely won’t show up during profiling.

1 more reply

kgeist3y ago

I wonder how Haskell's purity influences RC usage patterns. Are there tricks which aren't possible in an imperative language?

1 more reply

cmroanirgo3y ago

One reason you'd choose ref counting is because it's deterministic behavior, whereas you lose that granularity with gc, even if you did a gc cleanup.

I see great reasons for both systems being useful, but both systems also bring their own warts.

kgeist3y ago

>One reason you'd choose ref counting is because it's deterministic behavior

2 more replies

eru3y ago

> Scripting with ref counting would be a nightmare, [...]

Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).

bjourne3y ago

Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:

    Copying Collector                                8.9
    Reference Counting Collector                    21.9
    Cycle-collecting Reference Counting Collector   28.7
    Mark & Sweep Collector                          10.1
    Mark & Sweep (separate mark bits) Collector      9.6
    Optimized Copying Collector                      9.0