undefined | Better HN

0 pointsdeepnotderp8y ago0 comments

Just treat it as an infinite loop , there's no need to JIT in an optimized version that late.

0 comments

One of the core operations of the transformer network[1] is a (LxL) x (LxE) matrix multiply (where L is the sentence length and E is the network width). Can you be more specific about how you would get good performance without specializing on L?

[1] https://arxiv.org/abs/1706.03762

deepnotderpOP8y ago

You use the loop based GEMM kernel and inject the loop counters as the input size.

grandmczeb8y ago

L can be as small as 1 and bigger than 512. For small L it makes sense to do different optimizations than large L. A loop based GEMM doesn’t help with that.

j / k navigate · click thread line to collapse

0 comments

grandmczeb8y ago

[1] https://arxiv.org/abs/1706.03762

deepnotderpOP8y ago

You use the loop based GEMM kernel and inject the loop counters as the input size.

grandmczeb8y ago

L can be as small as 1 and bigger than 512. For small L it makes sense to do different optimizations than large L. A loop based GEMM doesn’t help with that.

j / k navigate · click thread line to collapse