Converting a large mathematical software package written in C++ to C++20 modules (opens in new tab)

(arxiv.org)

141 pointsvblanco10mo ago42 comments

42 comments

Thanks to author for doing some solid work in providing data points for modules. For those like me looking for the headline metric, here it is in the conclusion

  While the evidence shown above is pretty clear that building a software package as a module provides the claimed benefits in terms of compile time (a reduction by around 10%, see Section 5.1.1) and perhaps better code structure (Section 5.1.4), the data shown in Section 5.1.2 also make clear that the effect on compile time of downstream projects is at best unclear.

So, alas, underwhelming in this iteration and perhaps speaks to 'module-fication' of existing source code (deal.II, dates from the '90s I believe), rather than doing it from scratch. More work might be needed in structuring the source code into modules as I have known good speedup with just pch, forward decls etc. (more than 10%). Good data point and rich analysis, nevertheless.

Someone10mo ago

It wouldn’t surprise me if they could do better if they gave up on doing most of the work programmatically.

One part of me agrees with (both from the paper)

> For example, putting a specific piece of code into the right place in each file (or adding necessary header files, as mentioned in Section 5.2) might take 20-30 seconds per file – but doing this for all 1051 files of deal.II then will take approximately a full day of (extremely boring) work. Similarly, individually annotating every class or function we want to export from a module is not feasible for a project of this size, even if from a conceptual perspective it would perhaps be the right thing to do.

and

> Given the size and scope of the library, it is clear that a whole-sale rewrite – or even just substantial modifications to each of its 652 header and 399 implementation files – is not feasible

but another part knows that spending a few days doing such ‘boring’ copy-paste work like that often has unexpected benefits; you get to know the code better and may discover better ways to organize the code.

Maybe, this project is too large for it, as checking that you didn’t mess up things by building the code and running the test suite simply takes too long, but even if it seems to be, isn’t that a good reason to try and get compile times down, so that working on the project becomes more enjoyable?

jjmarr10mo ago

This is a great task for LLMs, honestly.

2 more replies

trostaft10mo ago

Oh, it’s Wolfgang. In computational math, he has a focus on research software that few others are able to do, he (the deal.ii team more generally) got an award for it last SIAMCSE. Generally a great writer, looking forward to reading this.

Asooka10mo ago

I would like to see a comparison between modules and precompiled headers. I have a suspicion that using precompiled headers could provide the same build time gains with much less work.

pjmlp10mo ago

As per Office team, modules are much faster, especially if you also make use of C++ standard library as module, available since C++23.

See VC++ devblogs and CppCon/C++Now talks from the team.

Pre-compiled headers have only worked well on Windows, and OS/2 back in the day.

For whatever reason UNIX compilers never had a great implementation of it.

With exception of clang header maps, which is anyway one of the first approaches to C++ modules.

fpoling10mo ago

This has been puzzling me for over 3 decades. My first experience with C++ was Borland C++ for DOS. It had precompiled headers and it worked extremely well.

Then around 1995 I got access to HP-UX and native compiler there and GCC. Nobody heard about precompiled headers and people thought the only way to speed up compilation was to get access to computer with more CPUs and rely on make -j.

And then there was no interest to implement precompiled headers from free and proprietary vendors.

The only innovation was unity builds when one includes multiple C++ sources into super-source. But then Google killed support for it in Chromium claiming that with their build farm unity builds made things slower and supporting them in Chromium build system was unbearable burden for Google.

1 more reply

dataflow10mo ago

Precompiled headers are generally better for system/3rd-party headers. Module are better than PCHs for headers you own, although in some cases you may be better off not using them at all. (I say these because the benefit depends on the frequency with which you need to recompile them, and the relative coupling etc.) Depending on how heavy each one is in your codebase, and how often you modify global build settings, you may have a different experience. And neither is a substitute for keeping headers lightweight and decoupled.

w4rh4wk510mo ago

From my experience, compile times ain't an issue if you pay a little attention. Precompiled header, thoughtful forward declarations, and not abusing templates get you a long way.

We are commonly working with games that come with a custom engine and tooling. Compiling everything from scratch (around 1M lines of modern C++ code) takes about 30-40 seconds on my desktop. Rebuilding 1 source file + linking comes in typically under 2 seconds (w/o LTO). We might get this even lower by introducing unity builds, but there's no need for that right now.

ttoinou10mo ago

40 seconds for 1M lines seems super fast, do you have a fast computer and/or did you spend a lot of time optimizing the compilation pipeline ?

3 more replies

barchar10mo ago

So, clang's modules are quite similar to clangs precompiled headers, especially the "chained" pchs. With PCH you have to wait on the serial PCH compilation step before you can get any parallelism, with modules you can compile each part of the "PCH" in parallel and anything using some subset of your dependencies can get started without waiting on things it doesn't use.

Header units are basically chained PCHs. Sadly they are hard to build correctly at the moment.

barchar10mo ago

A few points

1) modules only really help address time spent parsing stuff, not time spent doing codegen. Actually they can negatively impact codegen performance because they can make more definitions available for inlining/global opts, even in non-lto builds. For this reason it's likely best to compare using thin-lto in both cases.

2) when your dependencies aren't yet modularized you tend to get pretty big global module fragments, inflating both the size of your BMIs and the parsing time. Header units are supposed to partially address this but right now they are not supported in any build systems properly (except perhaps msbuild?). Also clang is pretty bad at pruning the global module fragment of unused data, which makes this worse again.

boris10mo ago

> Header units are supposed to partially address this but right now they are not supported in any build systems properly (except perhaps msbuild?).

They are supported in build2 when used with GCC (via the module mapper mechanism it offers). In fact, I would be surprised if they were supported by msbuild, provided by "properly" we mean without having to manually specify dependencies involving header units and without imposing non-standard limitations (like inability to use macros exported by header units to conditionally import other header units).

pjmlp10mo ago

VC++ has support header units for quite some time, in fact I had to revert back to global module fragments, because CMake/clang still don't have a plan for how to support header units, and I wanted to have my demo code work in more than just VC++.

KingLancelot10mo ago

To be fair, C++’s modules make no sense, just like their namespaces that span multiple translation units.

It’s just more heavy clunky abstractions for the sake of abstractions.

MathMonkeyMan10mo ago

Modules are an attempt to make part of the language what currently requires a convention:

- A component is a collection of related code.

- The component has an interface and an implementation.

- The interface is a header file (e.g. *.h) that is included (but at most once!) using a preprocessor directive in each dependent component.

- The header file contains only declarations, templates, and explicitly inline definitions.

- The implementation is one or more source files (e.g. *.cpp) that provide the definitions for what is declared in the header, and other unexposed implementation details.

- Component implementations are compiled separately (usually).

- The linker finds compiled definitions for everything a component depends upon, transitively, to produce the resulting program/dll.

So much can go wrong! If only there were a notion of components in the language itself. This way we could just write what we mean ("this is a component, here is what it exports, here are the definitions, here is what it imports"). Then compiler toolchains could implement it however they like, and hopefully optimize it.

pjmlp10mo ago

It makes lots of sense to anyone used to large scale software development.

It is no accident that Ada, Java, .NET, and oldies like Delphi, Eiffel, Modula-2 and Modula-3 have similar approaches.

Even the way D and Python modules and packages work, or the whole crates and modules approach in Rust.

Naturally folks not used to Web scale don't get these kind of features.

isatty10mo ago

The code block styling is less than ideal.

nsoonhui10mo ago

I really wonder whether LLMs are helpful in this case. This kind of task should be the forte of LLMs: well-defined syntax and requirements, abundant training material available, and outputs that are verifiable and validatable.

Perhaps we should use LLMs to convert all the legacy programs written in Fortran or COBOL into modern languages.

rsynnott10mo ago

You are far from the first person to have this very, very bad idea.

No, LLMs are not good at refactoring.

j / k navigate · click thread line to collapse

42 comments

npalli10mo ago

Thanks to author for doing some solid work in providing data points for modules. For those like me looking for the headline metric, here it is in the conclusion

  While the evidence shown above is pretty clear that building a software package as a module provides the claimed benefits in terms of compile time (a reduction by around 10%, see Section 5.1.1) and perhaps better code structure (Section 5.1.4), the data shown in Section 5.1.2 also make clear that the effect on compile time of downstream projects is at best unclear.

Someone10mo ago

It wouldn’t surprise me if they could do better if they gave up on doing most of the work programmatically.

One part of me agrees with (both from the paper)

and

> Given the size and scope of the library, it is clear that a whole-sale rewrite – or even just substantial modifications to each of its 652 header and 399 implementation files – is not feasible

jjmarr10mo ago

This is a great task for LLMs, honestly.

2 more replies

trostaft10mo ago

Asooka10mo ago

I would like to see a comparison between modules and precompiled headers. I have a suspicion that using precompiled headers could provide the same build time gains with much less work.

pjmlp10mo ago

As per Office team, modules are much faster, especially if you also make use of C++ standard library as module, available since C++23.

See VC++ devblogs and CppCon/C++Now talks from the team.

Pre-compiled headers have only worked well on Windows, and OS/2 back in the day.

For whatever reason UNIX compilers never had a great implementation of it.

With exception of clang header maps, which is anyway one of the first approaches to C++ modules.

fpoling10mo ago

This has been puzzling me for over 3 decades. My first experience with C++ was Borland C++ for DOS. It had precompiled headers and it worked extremely well.

And then there was no interest to implement precompiled headers from free and proprietary vendors.

1 more reply

dataflow10mo ago

w4rh4wk510mo ago

From my experience, compile times ain't an issue if you pay a little attention. Precompiled header, thoughtful forward declarations, and not abusing templates get you a long way.

ttoinou10mo ago

40 seconds for 1M lines seems super fast, do you have a fast computer and/or did you spend a lot of time optimizing the compilation pipeline ?

3 more replies

barchar10mo ago

Header units are basically chained PCHs. Sadly they are hard to build correctly at the moment.

barchar10mo ago

A few points

boris10mo ago

> Header units are supposed to partially address this but right now they are not supported in any build systems properly (except perhaps msbuild?).

pjmlp10mo ago

KingLancelot10mo ago

To be fair, C++’s modules make no sense, just like their namespaces that span multiple translation units.

It’s just more heavy clunky abstractions for the sake of abstractions.

MathMonkeyMan10mo ago

Modules are an attempt to make part of the language what currently requires a convention:

- A component is a collection of related code.

- The component has an interface and an implementation.

- The interface is a header file (e.g. *.h) that is included (but at most once!) using a preprocessor directive in each dependent component.

- The header file contains only declarations, templates, and explicitly inline definitions.

- The implementation is one or more source files (e.g. *.cpp) that provide the definitions for what is declared in the header, and other unexposed implementation details.

- Component implementations are compiled separately (usually).

- The linker finds compiled definitions for everything a component depends upon, transitively, to produce the resulting program/dll.

pjmlp10mo ago

It makes lots of sense to anyone used to large scale software development.

It is no accident that Ada, Java, .NET, and oldies like Delphi, Eiffel, Modula-2 and Modula-3 have similar approaches.

Even the way D and Python modules and packages work, or the whole crates and modules approach in Rust.

Naturally folks not used to Web scale don't get these kind of features.

isatty10mo ago

The code block styling is less than ideal.

nsoonhui10mo ago

Perhaps we should use LLMs to convert all the legacy programs written in Fortran or COBOL into modern languages.

rsynnott10mo ago

You are far from the first person to have this very, very bad idea.

No, LLMs are not good at refactoring.

j / k navigate · click thread line to collapse