undefined | Better HN

0 pointspclmulqdq2y ago0 comments

They are putting the whole LLM into SRAM across multiple computing chips, IIRC. That is a very expensive way to go about serving a model, but should give pretty great speed at low batch size.

0 comments

No comments yet.

0 comments

No comments yet.