undefined | Better HN

0 pointssangnoir2y ago0 comments

> Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.

Which is probably why Meta is also buying the biggest Nvidia datacenter cards by the shipload. There is no need to run inference for a small model - say for a text-ad recommendation system - on an H100 with attendant electricity and cooling costs.

0 comments

namibj2y ago

Also, like, FP tensor cores are way more expensive than fixed-point tensor cores, and with some care, it's very much practical to even train DNNs on them.

E.g. it's common to have a full-width accumulator and e.g. s16 gradients with u8 activations and s8 weights, with the FMA (MAC) chain of the tensor multiply operation post-scaled with a learned u32 factor plus follow-up "learned" notify, which effectively acts as a fixed-point factor with learned position of it's point, to re-scale the outcome to the u8 activation range.

By having the gradients by sufficiently wider, it's practical to use a straight-through estimator for backpropagation. I read a paper (kinda two, actually) a few months ago that dealt with this (IIRC one of them was more about the hardware/ASIC aspects of fixed-point tensor cores, the other more about model training experiments with existing low precision integer-MAC chips IIRC particularly for interference in mind). If requested, I can probably find it by digging through my system(s); I would have already linked it/them if the cursory search hadn't failed.

j / k navigate · click thread line to collapse

0 comments

namibj2y ago

Also, like, FP tensor cores are way more expensive than fixed-point tensor cores, and with some care, it's very much practical to even train DNNs on them.

j / k navigate · click thread line to collapse