Table-Driven Code Generation
A simple code generator design I've found to be useful, but still small and easy. Includes a sample version with source code.
Anyone who has written C has been in the situation—you write an enum
, and sooner or later, you’d like to associate values of that enum
with strings. You might want to log a value to plaintext in a human-readable form, or you might want to display the value in a user interface.
Whatever your reasons are, the lowest-tech solution is to write the following:
enum MyEnum
{
MyEnum_A,
MyEnum_B,
MyEnum_C,
MyEnum_COUNT
};
char *myenum_string_table[MyEnum_COUNT] =
{
"A",
"B",
"C",
};
With this extra string table, you can always map a value with type MyEnum
with the expression myenum_string_table[value]
. That being said, you may notice that an issue arises if you fail to maintain the enum
and its string table accordingly:
enum MyEnum
{
MyEnum_A,
MyEnum_X, // new value!
MyEnum_B,
MyEnum_C,
MyEnum_COUNT
};
char *myenum_string_table[MyEnum_COUNT] =
{
"A",
// uh oh, MyEnum_X's string will be "B"!
"B",
"C",
};
You can fix this, to some degree, by just using a switch
instead:
char *StringFromMyEnum(MyEnum v)
{
char *result = "<invalid>";
switch(v)
{
default: break;
case MyEnum_A: {result = "A";}break;
case MyEnum_B: {result = "B";}break;
case MyEnum_C: {result = "C";}break;
}
return result;
}
Now, whenever MyEnum_X
is passed into this function, you’ll at least get an “<invalid>”
string back. This will be easier to catch and fix. Unfortunately, by doing this, we’ve lost the original aspect of your data being statically represented in simple data tables, but this is not necessarily relevant.
You might wonder, though, “why not just specify both in the same place, so it’s much more difficult to screw up?” That’s a good question. This would solve the problem, and doing so might make the solution doable with only static data. Luckily enough, there’s also another low-tech solution for you, which is often called the “X-Macro”:
#define MyEnumTable \
X(A, "A")\
X(B, "B")\
X(C, "C")
enum MyEnum
{
#define X(name, str) MyEnum_##name,
MyEnumTable
#undef X
MyEnum_COUNT
};
char *myenum_string_table[MyEnum_COUNT] =
{
#define X(name, str) str,
MyEnumTable
#undef X
};
This shows us one interesting part of the problem, which is that what we really wanted all along was to specify our data in the form of a table—the enum
name is like a unique key for the table, and the columns are all of the values that can be obtained given that key.
X-Macros also can be a useful tool, but they do come with a few notable drawbacks:
The
enum
value identifiers—MyEnum_A
, for example—is never explicitly written in the above text. Thus, simple editors and parsers (that do not analyze the file after a full C preprocessor pass) will not be able to autocomplete to that identifier, or highlight that identifier as anenum
value.The definition of the table’s columns—
name
andstr
—is duplicated across all generation sites. If you want to add a third column to the table and map theenum
values to another type (for example, anotherenum
, or another string), you must now update all instances of#define X(name, str)
to read#define X(name, str, new_col)
.Each column’s value must conform to C’s syntax rules. If you embed raw textual content within single- or double-quotes, they will be duplicated in the respective generation sites.
These drawbacks—as well as those present within the earlier methods—are entirely artificial. In other words, they are purely consequences of the way that the C language and C preprocessor happened to be formulated—they are not intrinsically required by the problem. Why should specifying your data conveniently as a table suddenly compromise your editor features? It seems pretty ridiculous, at the end of the day.
So, what is the solution? In day-to-day programming, you might find one of the above options—static tables, a simple switch
, or generating one or the other with an X-Macro—perfectly suitable for your problem. None of them are really deal-breakers, although you might sit and wonder why such a simple concept is so poorly supported by even industrially-dominant tools.
That being said, there is a point at which the drawbacks of each method become more obvious, and more of a maintenance problem—or, at the very least, a hindrance to fast iteration time and the “joy of programming” as you tweak and iterate on things. You might see this more easily by drastically increasing the number of rows, drastically increasing the number of tables (possibly all being related and mapping between each other), or drastically increasing the number of columns, and then finally, increasing the amount of time spent maintaining all of the tables and their respective generators. In the case of X-Macros, there is also a point at which they no longer quite work to encode what you need because of their syntactic restrictions.
In those more extreme cases, what do you do? Should C have been designed differently to account for your use-case? Maybe to make X-Macros more usable, editors need to become more complicated and start parsing preprocessed C? Maybe we need to throw C out the window, and try an entirely different programming language that does a bit of a better job (along with a new code generator, debug information generator, editor parsing rules, analyzers…)? And then if that language screws this up in a different way, or fails to solve another problem well, oh well? Maybe you then just need a new language at that point? Maybe programming was the wrong career choice, and you need to depart from civilization and live in a log cabin in the woods?!
C being designed differently is impractical, of course, unless you have a time machine and an unrealistic amount of power over C’s evolution. In the case of editors, they have been written the way they are for a reason. After several decades of code editor competition and evolution, you’d expect that the most successful, modern editors might roughly approximate the “local maximum” balance of features, code complexity, performance, reliability, and so on. The fact that none of them seem to appropriately interpret generations from an X-Macro, for example, is a hint that the complexity of such a feature has simply not been worth it, or it has largely been at the expense of other important factors, like performance and reliability.
This doesn’t seem like a particularly complicated use-case, and a good solution that avoids all of the above drawbacks doesn’t seem like something wildly outside the scope of possibility. So, what gives?
The Level Of Authorship
In my estimation, a great deal of insight for this problem (and others) can be obtained by considering the existing pipeline you use as a C programmer as a series of functional data transformations:
Programmer’s Thoughts
→Text in C Language
Text in C Language
→C Tokens
C Tokens
→C Abstract Syntax Tree
C Abstract Syntax Tree
→C Type-Checked Tree
C Type-Checked Tree
→IR Data Structure
IR Data Structure
→Machine Code
When you consider the above pipeline, there is only one place where you really have direct “write access” as the programmer. That is, of course, the Text in C Language
—it’s what you edit day-in and day-out. I’d like to call this place the “level of authorship”. It is the unique place in the pipeline where a human sits, orchestrating things. Any modification that a human would make further down the pipeline would necessarily be washed away and forgotten in another pass through the pipeline, thus breaking the pipeline’s automation. So, the level of authorship is important—it is the place where the human’s intent is captured and preserved over time.
Steps 2-6 of the above pipeline are all controlled by your C compiler—for this reason, I think it is often unintuitive for programmers to imagine tweaking this pipeline in any way.
This, among other factors, is partly responsible for the modern mind virus which suggests there ought to only be a single level of authorship, and whatever programming language compiler you’re currently using must be sufficiently complex to capture your desired intention (and the associated effects, later on in the pipeline). For this reason, many modern programming language projects believe that the primary problem with our existing languages is that they are not complex enough, and that the compiler needs to step up and provide a richer interface for communication. As a result, compilers inevitably become larger, slower, more complicated, and thus more difficult to rewrite. In effect, this has a centralizing effect, which is detrimental to competition between compilers for a single language, and to programmer self-sufficiency. So, I posit that this is not the correct direction. It is both more practical in day-to-day projects—and better for the computing ecosystem—to try another approach.
Introducing A Second Level Of Authorship
As you might be able to tell from my tone, assuming a single level of authorship is a problem, simply because C—or for that matter, any language, or really any medium of information—does not sufficiently capture the expressiveness you want for all problems.
The solution I’ll cover in this article is one that introduces a second level of authorship.
Let’s start by noting the unsurprising revelation that “compile-time execution”—which many new, fancy languages cram as a complicated feature into their compilers—can be achieved trivially in C, by writing a separate program and executing it before your primary program is built. This clicks into place with the rest of the ecosystem—the compile-time program can be trivially debugged with whatever debugger you use normally (because it is just a program), the artifacts of the compile-time program can be easily inspected and checked into source control, and so on.
For this problem—managing large and complex tables, keyed by a constant value—your program can simply take in files in our new level of authorship format, which can be specially designed for the problem of encoding tables, and they can then output automatically-generated C source code to make the data usable from your other level of authorship (the C source code that you handwrite). The end result is table-driven code generation.
My Solution For Table-Authorship
Given that background, I’ll walk through my own solution for this problem. At the end of the article, for paid subscribers, I’ll link to a repository containing the code for an example generator that implements my solution. You could download it, build it, and use it out of the box, but I am providing the source code so that you’re also able to tweak it to your liking.
You could choose a number of input formats for your table-driven generation program. You might even want to just use CSV files, and use Excel (or a similar tool) for your editor. While I’m very fond of graphical editors, and while I’m acutely aware of the inadequacies of text files, I prefer to keep multiple levels of authorship within the same editor, because it’s very common to jump between them. Perhaps this is a shortcoming of operating system user interface paradigms, or of their window managers—perhaps the next operating system generation could make different programs compose much more easily. But, alas, that’s not the world we live in (yet?), so I digress… In any case, my preference is to use the text editor as the authorship tool, and thus to use a custom textual format for this second level of authorship.
For rapidly building custom “level of authorship formats”, I built the Metadesk format and library. This is a simple textual format, akin to JSON (in fact, it’s a superset of JSON), but designed to be easily handwritten, read, and parsed into a flexible data structure in custom code generation programs. Metadesk offers the ability to easily express hierarchies of uniformly-typed nodes, with the additional feature of “metatags” on each node. My solution uses Metadesk as the “base format” for encoding tables.
Tables, then, can be specified to my generator as such:
@table(name, str) MyEnumTable:
{
{ A "A" }
{ B "B" }
{ C "C" }
}
Then, similar to X-Macros, “generation nodes” can be expressed within the same file, to perform a transformation taking that table as input, and outputting some C code as output:
@table_gen_enum
MyEnum:
{
@expand(MyEnumTable a)
`MyEnum_$(a.name),`;
`MyEnum_COUNT`;
}
@table_gen_data(`char *`)
myenum_string_table:
{
@expand(MyEnumTable a)
`"$(a.str)",`;
}
The above will produce the following C code, in an output file (the path of which is determined from the input Metadesk file, with the rule of */x.mdesk
→ */generated/x.meta.c
and */generated/x.meta.h
:
// --- h file -----------------------------------
typedef enum MyEnum
{
MyEnum_A,
MyEnum_B,
MyEnum_C,
MyEnum_COUNT
}
MyEnum;
extern char *myenum_string_table[3];
// --- c file -----------------------------------
char *myenum_string_table[3] =
{
"A",
"B",
"C",
};
The essence of these “generator nodes” is that they generate list of strings, with the added twist that they can “expand” a given table for a given string. “Expanding” a table is akin to looping over the rows of a table, and attaching a label to that “expansion”. In other words, @expand(SomeTable a)
is akin to saying “loop over all rows of SomeTable
, and give this loop the label a
”. For a given expansion, then, “a
” can be used to reference the current row for the associated expansion. The string that is “expanding” the table is not simply textual data—it is treated as encoding rules for a simple string building language, where a column within the row of an expansion, as well as some operators, can be used within an occurrence of $(…)
within the string.
The reason why it is not simply expressed as @expand(SomeTable)
is because this “expansion” can be applied many times. So, if you have two tables, you can loop over both of them, akin to a nested for
-loop. Similarly, you can loop over the same table multiple times as well. This is why the label portion of the expansion is required, as it is the means by which you’re able to distinguish between multiple expansions.
This core generation feature is then wrapped in metatags expected by the generator program—table_gen_enum
, or table_gen_data
—which add a bit of sugar to the generator (to wrap the generations in the appropriate C enum
syntax, for example). There is also a purely generic table_gen
generator node metatag that adds no sugar.
Use-Case 1: UI Commands
These days, when designing programs with a user interface, I require the hard constraint that a “command lister” (also called a “command palette”), should always be easily accessible by users, because I find the traditional expectation that the user should search through a user interface hierarchy to find the operation they need to be unacceptable. In many cases, the user has some idea of what they want to do already, and where the right buttons are in the user interface design is not the relevant information they’re missing—instead, the missing information is how they inform the user interface that they’d like some operation to occur. For these cases, a “command lister” is handy.
There is, unsurprisingly, a lot of handwritten data to maintain for a uniform list of all user interface commands. That data must include some key for the commands (in simple cases, this might just be an enum
constant), but it also must include quite a lot of metadata:
The display string of the command (to refer to the command in the user interface)
The description string of the command (to show in the command lister as the user browses commands
Additional “search tags” to cover common synonyms or other words for fuzzy-matching that doesn’t hit the display string nor the description string
Some mapping to an associated user interface (e.g. the “open file” command should be associated with the “file browser” user interface, and that user interface should be visible when the user is completing an “open file” command)
Some information encoding the “fast-path” of the command—what does hitting the hotkey for the command, or activating it from the command lister, do? Does it just run the command? Or does it require completion, in the case of the “open file” command?
The canonical icon for the command
The default key-binding for the command
As you might imagine, there are a number of tables involved here—that for the commands, that for the user interfaces (as I call them, the “views”), that for icon metadata, that for the “fast-path” rules, and so on—and they all must map to each other in various ways. This table generator has been extraordinarily useful in maintaining these tables overtime. Unsurprisingly, this is all fairly coupled with details on specific projects I am working on, so I won’t provide any code examples—but, hopefully, you can infer how the table generator would be used for this problem.
Use-Case 2: Abstraction Backend V-Tables
In some cases, a project requires an abstraction over several backends, all providing some similar functionality. This arises in rendering, when you must abstract over supported GPU APIs for many platforms (D3D11, OpenGL, Vulkan, etc.). It also arises in font rasterization, where you may prefer native libraries (e.g. DirectWrite) when they are available, but where you might still want to fall back on something else when they’re not available (FreeType, stb_truetype
, etc.).
In such cases, “abstraction” means defining a common API that you’re able to implement for each backend. In a subset of those cases, you may want to dynamically link to a backend. In implementing such a system, you may find that many details—the name and type of a given API, notably—are duplicated.
Let’s take rendering for example. Recently, I wrote a rendering abstraction layer—I wanted the possibility to dynamically link to a (or possibly many) rendering backend(s). The v-table I had, at the end of the day, for the renderer backend abstraction looked like this:
typedef R_Handle R_EquipOSFunction(void);
typedef R_Handle R_EquipWindowFunction(R_Handle os_eqp,
OS_Handle window);
typedef void R_UnequipWindowFunction(R_Handle os_eqp,
R_Handle window_eqp);
typedef R_Handle R_ReserveTexture2DFunction(R_Handle os_eqp,
Vec2S64 size,
R_Texture2DFormat fmt);
typedef void R_ReleaseTexture2DFunction(R_Handle os_eqp,
R_Handle texture);
typedef void R_FillTexture2DFunction(R_Handle os_eqp,
R_Handle texture,
Rng2S64 subrect,
String8 data);
typedef Vec2F32 R_SizeFromTexture2DFunction(R_Handle texture);
typedef void R_StartFunction(R_Handle os_eqp,
R_Handle window_eqp,
Vec2S64 resolution);
typedef void R_Submit2DFunction(R_Handle os_eqp,
R_Handle window_eqp,
R2_CmdList commands);
typedef void R_FinishFunction(R_Handle os_eqp,
R_Handle window_eqp);
typedef struct R_Backend R_Backend;
struct R_Backend
{
R_EquipOSFunction *EquipOS;
R_EquipWindowFunction *EquipWindow;
R_UnequipWindowFunction *UnequipWindow;
R_ReserveTexture2DFunction *ReserveTexture2D;
R_ReleaseTexture2DFunction *ReleaseTexture2D;
R_FillTexture2DFunction *FillTexture2D;
R_SizeFromTexture2DFunction *SizeFromTexture2D;
R_StartFunction *Start;
R_Submit2DFunction *Submit2D;
R_FinishFunction *Finish;
};
I also wanted a missing implementation to gracefully fail, so I wrote some stub functions that log an error:
R_Handle R_EquipOSFunctionStub(void);
R_Handle R_EquipWindowFunctionStub(R_Handle os_eqp,
OS_Handle window);
void R_UnequipWindowFunctionStub(R_Handle os_eqp,
R_Handle window_eqp);
R_Handle R_ReserveTexture2DFunctionStub(R_Handle os_eqp,
Vec2S64 size,
R_Texture2DFormat fmt);
void R_ReleaseTexture2DFunctionStub(R_Handle os_eqp,
R_Handle texture);
void R_FillTexture2DFunctionStub(R_Handle os_eqp,
R_Handle texture,
Rng2S64 subrect,
String8 data);
Vec2F32 R_SizeFromTexture2DFunctionStub(R_Handle texture);
void R_StartFunctionStub(R_Handle os_eqp,
R_Handle window_eqp,
Vec2S64 resolution);
void R_Submit2DFunctionStub(R_Handle os_eqp,
R_Handle window_eqp,
R2_CmdList commands);
void R_FinishFunctionStub(R_Handle os_eqp,
R_Handle window_eqp);
But furthermore, for the vast majority of cases, I simply want the ability to “select” a backend, and call an abstract function, knowing that it’ll be routed to the correct backend. So, I wrote the following wrapper:
void R_SelectBackend(R_Backend backend);
R_Handle R_EquipOS(void);
R_Handle R_EquipWindow(R_Handle os_eqp, OS_Handle window);
void R_UnequipWindow(R_Handle os_eqp, R_Handle window_eqp);
R_Handle R_ReserveTexture2D(R_Handle os_eqp, Vec2S64 size,
R_Texture2DFormat fmt);
void R_ReleaseTexture2D(R_Handle os_eqp, R_Handle texture);
void R_FillTexture2D(R_Handle os_eqp, R_Handle texture,
Rng2S64 subrect, String8 data);
Vec2F32 R_SizeFromTexture2D(R_Handle texture);
void R_Start(R_Handle os_eqp, R_Handle window_eqp,
Vec2S64 resolution);
void R_Submit2D(R_Handle os_eqp, R_Handle window_eqp,
R2_CmdList commands);
void R_Finish(R_Handle os_eqp, R_Handle window_eqp);
You might notice a hint of duplication. :)
Unsurprisingly, all of this information can be succinctly encoded into a table:
@table(name, ret, params)
R_BackendHookTable:
{
{ EquipOS R_Handle `void` }
{ EquipWindow R_Handle `R_Handle os_eqp, OS_Handle window` }
// ...
}
With the generator nodes producing the list of wrappers, the list of stubs, the list of typedef
s, a list of macros that can be used to easily define an implementation (without duplicating the parameter list), and so on.
Closing Thoughts
I’ve spent a lot of free-time figuring out how to automate certain programming tasks over the years—this certainly isn’t the first code generator I’ve written. That being said, it is certainly the first one I’ve found that seems to have very broad applicability—it seems to constantly come in handy, so I wanted to write about and share it.
I hope this article was useful, or at least interesting!
If you enjoyed this post, please consider subscribing. Thanks for reading.
-Ryan
Paid Subscribers: Generator Repository
Code Depot Access Registration, for paid subscribers
If you aren’t a paid subscriber, you’ll need to become one to obtain access.
Great read, I was interested in how people deal with metaprogramming in C. Out of interest, how do you feel about compilers that do monomorphisation with generic types? It seems like an interesting world is one where that particular step of a compiler happens early and outputs to a file, rather than outputs to some chunk of assembly in the final result.
It's probably a result of too much enterprisey code distributed across too many physical systems, but I've come to very much appreciate being able to constrain the types of some data structure with some soft of behaviour (for instance it might implement some interface), which I think is easier done as a compilation step.
If nothing else I do thoroughly appreciate the ability to physically see outside of a debugger what the generated code looks like; I've interracted with far too many absolutely bizarre implementations that are significantly harder than necessary to debug.
Great article!
Here is my previous post a nice markdown view if you want (the code is more readable) : https://hackmd.io/@Mewily/HyLyk_wla