The organizations that release the models already provide (brag about) their model performance. They could simply include in the same report the info about the energy spent doing the training/finetuning/inference, per X tokens.
This doesn't necessarily measure every use, just "manufacturer's spec", the same you get for eg energy class for house appliances (at least in the EU). Nobody goes around measuring refrigerator power usage, but when you're buying one, you get a rough indication of how "green" (or not) it is.