Good question. There are only a few functions implemented in .cc files, tagged with HWY_DLLEXPORT, notably memory allocation and detecting x86 CPU capabilities. If it were necessary, we could likely strip those out or do a header-only library. The ops/intrinsics called from user code are inlined in headers.