A better approach is to split the model with MOEs running on CPUs and MLAs running on GPU. See the ktransformers project:
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...This takes advantage of the sparsity of MOE and the efficient KV-cache of MLA.