I agree, if you don't know anything about how convolution is implemented (filter packing, data packing, matrix multiplication, sum unpacking), you could be lost. But it's very shallow compared to a JIT or CUDA library scheme, and a knowledgeable ML performance engineer would have no difficulty
The inference function (at the end of the C file) is a series of blocks, each block corresponding to a convolution or other complex operation. It's straightforward to see which, by looking at where the weights come from (a field in a struct that has the same name as the layer in your graph)
If you use perf top (for example) you can see which convolution was most expensive, and why. Does the shape of the tensor produce many small partial blocks around the edge, so the packing is inefficient (a lot of tile overhang), for example? You can see that by glancing at the code and seeing that there are many optimized blocks around the edges. As a rule, if NN-512 generates small code for a tensor (few edge cases) you have chosen an efficient tensor shape, with respect to the tile
Or you might find that batch normalization is being done at inference time (as in DenseNet), instead of being integrated into the convolution weights (as in ResNet), because there's fanout from the source and a ReLU in between. You can see that easily in the generated code (the batch norm fmadd instructions will appear in the packing or unpacking code)
Is the matrix multiplication slow because there are too few channels per group (as in ResNeXt)? Easy to see in perf, make your groups bigger. Are you using an inefficient filter shape, so we have to fall back to a slower general purpose convolution? You can easily see whether Winograd or Fourier was used
And so on