Hmm, some further improvements:
typedef float F;typedef int I;
// saves 2 lines (but zero bytes)
#define R return
// OR
#define O(S,A,R) operator S(A){return R;}
#define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}
and so on - although you'd probably have to make semantic chages to get onto a single card. (Or use a smaller font, but that's presumably cheating.)