Not sure what you meant by "toy benchmarks", but to me this has a very negative connotation. The SBCL compiler excells at complex code and regularly produces much better output than most Lisp compilers. CLOS is a very bad example for judging the compiler quality, because its performance heavily depends on the implementation. To compare compilers, you would have to use the identical underlying CLOS implementation. A bad one can ruin your performance independantly of the compiler used. To make comparisons, one needs benchmarks where the whole code in question is compiled by the different compilers.
To my knowledge, SBCL switched to a precise GC some time ago. But I still would expect ACL's GC to outperform the SBCL one.