37 Comments

A couple of questions:

1. People usually make a scratch arena locals to a thread using something like __declspec(thread). What if you're on a platform or project that can't use thread-local storage? I usually see people handle this by passing around a "thread context" Care to comment on that?

2. The push and pop operator on the current arena implementation aren't atomic, which mean they aren't thread-safe. How do you go about dealing with memory management for multithreading?

3. Do you think the idea of ArenaTemp can be made more abstract and applied to other kinds of allocators? For example, I may have a generic heap allocator that keeps a free list and allows for out-of-order alloc and free allocations. I can still create an ArenaTemp for this allocator by storing the free list in the ArenaTempBegin and restoring it in the ArenaTempEnd function. The same also goes for a pool allocator, for example. I can still store the current free list and reset it later.

Expand full comment

1. All consumer machines - phones, game consoles, PCs, laptops - have hardware-level thread-local storage support. I don't write for other systems, so I don't ever do anything else. Per-thread storage can be emulated nonetheless - that is, indeed, what "the stack" is (there is one stack per thread).

2. The arena push and pop operations are not intended to be used by multiple threads by themselves. The arena, like the stack, is a useful pattern when accessed *by a single thread at a time*, so either in a single-threaded context, or in some interlocked shared data structure. In general, threads should try to synchronize as little as possible, and so they will often be operating on *different* arenas altogether. For communication *between* threads, an arena is not really the correct "shape" in most cases - another primitive, like a ring buffer, is much more suitable. In other words, the arena implementation is not the right layer to solve for multithreading.

3. You can apply the same principle to other allocators but I don't see any point in making the code more abstract. ArenaTemp is so simple for arenas, and it's not so simple for other allocators. Given that ~95%+ of allocations occur on arenas, I would not pessimize the 95% case for code reusability in the <5% cases - I'd just make another "Temp" idea at that layer (if it was indeed useful).

Expand full comment

Thanks for answering. Regarding multithreaded allocators, I understand that threads should communicate with each other as little as possible, meaning each thread should have its own allocators. But I don't understand why a ring buffer is a better fit here. This also leads me nicely to the underlying question I was trying to ask. You've talked a lot (in this post and some other lectures) about the idea of composing arenas and other allocators together to create several more sophisticated ones (for example, combining a free list with an arena to create a simple pool allocator). Asking about multithreaded or malloc-style heap allocators was just me trying to know more about other composed allocators that you find interesting and handy in your experience.

Expand full comment

Ringbuffers are the industry standard for lock-free multithreading communication. Feel free to check out my MCMP Queue implimentation, which is a multi-consumer multi-producer ringbuffer.

https://github.com/173duprot/mcmpq.h/

Expand full comment

I'm inclined to agree that memory safety is often made an exaggerated concern, and argued with false dilemmas, strawman arguments and pretending that the GC, borrow checker or smart pointers are the ultimate solutions. However, I do think there's valid concerns with widespread use of C. Keep in mind, I only work with C recreationally, not professionally.

In my experience, C has some powerful strengths:

1. It's very easy to parse and build a basic compiler on any hardware

2. It doesn't have that many features, and most of them are both features rooted in what hardware (i.e. ASM) needs to do, and rooted in common programming patterns, without many critical restrictions on what you can do, leaving it very powerful.

3. Due to its ASM-adjacency, high flexibility and well-defined ABI, it's also very easy for other languages to interface with code written in C.

4. It's pretty fast to write and iterate on optimized code in C, if you have the expertise.

5. It's a very popular "high"-level language used by some of the biggest operating systems and many of the most powerful applications.

However there's also a lot of weaknesses I've noticed:

1. It has a pretty bad standard library (even though many others are worse), and (perhaps as a result of that) much of the public ecosystem is not great at making good APIs for useful features. I've found that it's often better writing everything yourself, and if that's the case generally, then that means bigger knowledge barriers and less opportunities to write ad hoc software tools.

2. Null-terminated strings may have made sense in the 1970's, but that trade-off isn't worth it today. You can roll your own length-prefixed strings and API, but syscalls, OS calls and most libraries generally still only accept null-terminated strings. So you need stuff like:

a) char *CopyToNullTerminated(string LengthPrefixed); //Using scratch space to allocate the copy. Could have an overload that used an arena instead.

b) char *VerifyOrCopyToNullTerminated(string LengthPrefixed); //Checking if it's secretly null-terminated, and copy if it isn't.

But that's all unconventional for C and will be yet another company-/project-specific implementation of a common everyday problem that should be part of the standard ecosystem. Including stuff like "here-strings" where you have lots of special symbols, including strings that contain literal code. I like Jai's #string in that regard.

3. It has very underwhelming metaprogramming features in the language or standard library. By metaprogramming I mean code generation at compile-time, like its macros but more powerful. Albeit it can clearly still be achieved quickly in a specific project, by knowing how to write a lexer and encouraging or enforcing (through compile-time validation) standards on your team.

4. It has poorly adapted to the reality that SIMD and multi-threading is a requirement for good performance. In my naive mind, it would've been great if vectors were integrated more cleanly into the language, similar to shader languages. Making it natural to tinker with wide operations. And for multi-threading, I'm too inexperienced to know what the most versatile and powerful concepts are, but at a very minimum, providing simple worker queue APIs that discourage synchronization, as a standard library.

5. As a language that is often praised for its level of control, it still allows undefined behavior, and not giving direct control to programmers for those scenarios.

6. While not a language issue, it's still an everyday fact that most C compilers have bad error messages, even when you turn warnings into errors and disable the warnings that are counterproductive to tinkering. It could be a lot more clear, especially for newbies.

7. It is not easy to learn for programmers who only have experience with GC'ed languages, unless guided by experienced programmers.

Those are the top reasons why I don't program in C professionally. I'm curious what other programmers think, and what trade-offs they see with C.

Expand full comment

instead of having a thread local scratch arena, what about instead just using the same arena where you want stuff to persist after a function call, but wrapping it in a type that has a pointer to that arena + an offset by value, then you pass that to ar function by value.

then you can just allocate stuff that needs to persist vs stuff that needs to be scratch by just using either the top of the arena for example for scratch allocations and the bottom for the persistant allocations, but using the passed by value offset for the scratch allocations so it resets after the function?

and then if you want FunctionA's result to not be a result, and instead be an intermediary thing that should be placed in FunctionB's "scratch" arena, then you can just flip what alloc / scratchAlloc does. i.e. make alloc use the passed by value offset, and scratchAlloc use the arena pointers offset, and have the persistant allocations alloc from the top while the scratch allocations alloc from the bottom?

Expand full comment

Great content. Thank you for the detailed explanation.

Expand full comment

Ah, I love this :D

Expand full comment

Thanks for the article! Very inspiring.

I wonder what you think about pathological cases with the pool allocator, where memory gets tied up in the free list of one struct and becomes unavailable for other allocations.

For example, let's say that, through user error, the level editor for a game allocates and frees 200k entities (the user selected the entire level and copied it 100 times by accident, then pressed undo). Now a potentially large amount of memory gets leaked into the entity free list (and other associated free lists).

Perhaps this is a reason to use a malloc-style allocator, where freeing memory makes it available for other kinds of allocations?

Expand full comment

I'm glad you enjoyed it!

First I'd say that the memory in free lists is not *leaked*, really, at least not any more than a free'd malloc allocation leaks when the allocator keeps it committed. Each element in the free list is still available for subsequent allocations, those allocations just must be of the same type. You're right that a proliferation of types will therefore lead to possibly large free-lists which are unusable for many kinds of allocations. For this reason, among others, I recommend against a proliferation of types - it keeps everything simpler. Fully explaining all the ways in which it simplifies everything is out-of-scope for this comment, but you see the argument in this case.

If you have gigantic free lists like in a case you describe, that memory being physically committed may increase the nominal memory usage of a program, but it will not have detrimental paging effects (because that memory is merely sitting around waiting to be re-allocated - it is not actively being used). Note that the simple free list (stack) works in the order you'd want - the last thing freed is the first thing to be used. So, in your example, the user first undoes the *last paste*, and lastly undoes the *first paste*. Upon more allocations, the first things pulled off the free-list will be at the *lowest address*. In this level editor, then, growing back to the 200k entities gracefully occurs. The 200k entities still sit around being committed, but I'd claim that isn't a "real" problem. If it's worrisome (for some reason), then that constraint will indeed require more complex strategies.

These more complex strategies may include, as you suggest, a more general purpose allocator (reusing freed memory for other sorts of allocations), and/or the allocator attempting to group freed allocations together and physically decommit those pages. You can imagine, for instance, the basic allocation block being 4k entities, and then always preferring already-allocated blocks. That also has pathological cases, but then again, so does `malloc`, and so does everything else - it's just a matter of what you expect and what you care to avoid.

Expand full comment

Thanks for the reply!

I suppose the main questions are 1) how much memory the free lists actually take up and 2) how many worst cases you expect over the lifetime of the program.

I suspect maybe the amount of memory isn't that large even for my 200k entity example. Perhaps a few 10s of MBs, maybe 100s max for a complex game and editor, which isn't much for a modern PC, especially a dev machine.

But those numbers can add up over days of program use if they are spread out over disparate free lists. I suppose this is where minimizing the number of different types can help. I'm just not sure to what extent that is possible given sufficiently complex program requirements (but otherwise I agree that this is good advice for simplifying code).

I was also thinking about (and you already cover this to some extent) the granularity of arenas. Instead of a single permanent arena, you could have one per level. Or even one per entity to cover cases where a single entity requires a lot of dynamic allocations. But this approach introduces its own drawbacks similar to having to call free on individual allocations.

Expand full comment

Hi Ryan! Loved the article so much I just subscribed!

Is the Code Depot with the example locked? Would love to see how you use Arena Allocation in a small example.

Expand full comment

Thanks for subscribing, Jake. I'm glad you liked the article! The Code Depot is locked - I need to make you an account. Please check out the Paid Subscriber Registration post here: https://www.rfleury.com/p/paid-subscriber-registration

Expand full comment

Just sent and thanks a ton!

Expand full comment

Regarding the per-thread scratch arena system, can you have an additional per-thread int scratchCount that increases every time you call GetScratch and decreases every time you call ReleaseScratch? Then every time you need a new scratch arena, rather than searching, you can get it from the index after the int. This will also give the maximum number of unique scratches you need in any instance.

Expand full comment

This is not how this scratch system works. If you call GetScratch twice in a row, both for temp allocations (you don't pass first scratch as a conflict to the other), you will get same arena for both calls. And that is fine, this is how stack works. Even in case when arenas are in conflict, fixed counter would disable alternating same arenas in arbitrarily deep stack as Ryan mentioned with his example.

Expand full comment

So I got a notification about your reply about why did I want that, but I can't find it, so maybe you deleted your reply? I will explain my idea a little more so that we are all on the same page.

The OP was talking specifically about the per-thread scratch system and was not about the stack-like arena fashion. In the article, Ryan introduced the GetScratch function that takes in an array of Arena* and returns a unique arena to the array from the thread pool (technically, it returns a TempArena that reference the unique arena). The original array of arenas can only contain a thread-local scratch arena when and only when that scratch arena was from the thread-local scratch pool by calling GetScratch. My idea was rather than pass in a not-to-conflicts-with array to the GetScratch, it just has a counter that gets increased every time it gets called and returns the element at that index from the pool. Then when the current function is done using the scratch arena, it will call the ReleaseScratch function which will decrease the counter by one so the latter functions can reuse the arena.

Expand full comment

I don't know if by OP you meant your original reply or Ryan's post. But Ryan introduced scratch system in the context of stacks.

As for your implementation, yes that would work too, but then you'd have to dynamically increase arena pool when your memory stack grows beyond pool capacity. (Or allocate large enough pool at start by measuring your code, which could be a bit tricky if you use recursions a lot). I think Ryan's approach is more elegant exactly because of arena re-usage. You don't have to measure your code, you just have to follow simple rule Ryan outlined. As long as only a single persistent arena is present at any point in any codepath, you will not need more than two scratch arenas.

Expand full comment

I meant my original reply.

So it seems like you misunderstand me. You can still have a persistent arena, and all your functions still need to pass in an arena like the function A and function B example. I was talking specifically about the GetScratch and ReleaseScratch functions that were used inside functions A and B.

void *FunctionA(Arena *arena)

{

ArenaTemp scratch = GetScratch(); // increase the counter

// use the scratch arena

// fill result to Arena* arena

ReleaseScratch(scratch); // decrease the counter

return result;

}

void FunctionB(void)

{

ArenaTemp scratch = GetScratch(); // increase the counter

void *result = FunctionA(scratch.arena); // the scratch inside A will not be the same as this scratch

// use result

ReleaseScratch(scratch); // decrease the counter

}

Expand full comment

I can't find a way to turn on the code format in the reply section, so here's a link with some example code about the difference between Ryan's way and mine: https://pastebin.com/h53XsVzi

Expand full comment

Consider this usage code:

ArenaTemp a = GetScratch();

ArenaTemp b = GetScratch();

ReleaseScratch(a);

ArenaTemp c = GetScratch();

In your approach, `b` and `c` would be the same scratch arena.

Expand full comment

You can't get the same arena until the counter goes down. The GetScratch method only gets the scratch at the counter and increments it by one. Kind of like this: #define GetScratch(name) ArenaTemp name = scratchPool[scratchCount++] and #define ReleaseScratch(name) ArenaTempEnd(name); scratchCount--

Expand full comment

Fantastic post! Mind-expanding and elegant :-D.

I just have one question: how do we pull data out of the arena? And why not make arenas hash maps rather than contiguous blocks, so we can pull out the values by name, even though that would be a bit less efficient?

Thanks!

Expand full comment

Push functions return pointer to the memory block that you can then write into. As for hash maps, I believe those are not on the same level of abstraction. Hash map would be classified as a cointainer. You can easily implement hash map on top of arenas. Same goes for dynamic arrays as Ryan mentioned in the article.

Expand full comment

Is the idea that every time we use an arena to allocate new memory to create a variable inside of a function, that we return that variable?

Expand full comment

Right - in the same way that you don't ask the `malloc` allocator for allocations you've performed, you don't ask an arena for those either. That's up to the usage code of an arena. So you push something onto an arena with e.g. `Foo *foo = PushArrayZero(arena, Foo, 1);`. Then, it's up to *you* to use `foo` for whatever, store it somewhere, etc. - bookkeeping does not belong at the layer of the arena.

Expand full comment

Good to see you introduce arenas. Your APIs that take `arena` as parameter remind me of Zig, are the concepts identical?

Expand full comment

I haven’t looked at Zig’s memory allocator strategies much, but from my understanding Zig requires that allocating functions take allocators as parameters. In that sense, it’s the same. It’s different in that I’d expect Zig’s allocators to be abstract (this might not be the case), whereas arenas are a specific type of basic allocator.

The second part is important. Arenas simplify both implementation and usage code. Code that must work with a generic allocator will be more complicated than code that needs to work only with arenas.

Expand full comment

Just double-checked the docs… in Zig you can override the implementation with as much granularity as you like: https://ziglang.org/documentation/0.9.1/#Choosing-an-Allocator

In terms of mass deallocation of memory—i.e., deallocate all memory allocated at once—with arenas I assume others have implemented efficient ways of doing this? - With the system [kernel] allocator its straightforward but often incurs nontrivial overhead.

I like the idea of using an arena to allocate a large tree datastructure, but without granular bookkeeping I wouldn't think more complex data-structures could be efficiently implemented (like Brodal's finger trees).

PS: Enjoyed this line "like a government agency, except in this case, the garbage collector is ostensibly doing something approximating useful work—although both function by stealing valuable resources involuntarily" ^_^

Expand full comment

Yeah, so it seems like my expectation was correct. The problem I have with that design decision (which matches that of Odin as well) is that it forces code written for an abstract allocator - for which the implementation can change - to use the *maximally generic allocator API*. With an arena, the burden of freeing individual allocations all but disappears (except when doing more complex allocators on top of an arena, e.g. a pool allocator) - so code written for an arena is much simpler than code written for a generic allocator (even if that generic allocator no-ops on all of the frees, and indeed is implemented with an arena). The generic interface remains a pessimization for the allocator usage code.

The "granular bookkeeping" you mention (for more complex data structures) is what I describe as "a more complex allocator that composes with an arena", e.g. the free-list + arena introduced in the post. So it's definitely possible, it's just that it requires *composition with an arena*. The approach of e.g. Zig and Odin is to *abstract over the allocator*, but I disagree with their approach - instead, use the simplest allocator as a building block, and add more building blocks as necessary.

Regarding the government line, glad you enjoyed it. I have to sneak in my ancap somewhere. :)

Expand full comment