That Bluefield hardware looks neat, although it also sounds like a real project to program it :).
I can imagine two credible configurations for high efficiency:
1. A motherboard with a truly minimal CPU for bootstrapping but a bit beefy PCIe root complex. 32 lanes to the DPU and a bunch of lanes for NVMe. The CPU doesn’t touch the data at all. I wonder if anyone makes a motherboard optimized like this — a 64-lane mobo with a Xeon in it would be quite wasteful but fine for prototyping I suppose.
2. Wire up the NVMe ports directly to the Bluefield DPU, letting the DPU be the root complex. At least 28 of the lanes are presumably usable for this or maybe even all 32. It’s not entirely clear to me that the Bluefield DPU can operate without a host computer, though.