MTIA v1: Meta’s first-generation AI inference accelerator (opens in new tab)

(ai.facebook.com)

110 pointsthinxer3y ago44 comments

44 comments

Comparing MTIA v1 vs Google Cloud TPU v4:

MTIA v1's specs: The accelerator is fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W. Up to 128 GB of ram LPDDR5.

Googles Cloud TPU v4: 275 teraflops (bf16 or int8), 90/170/192 W. 32 GiB of HBM2 RAM, 1200 GBps. From here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

So it seems that the Google Cloud TPU v4 has an advantage in terms of compute per chip and ram speed, but the Meta one is much more efficient (2x to 4x, it is hard to tell) and has more ram but it is slower ram?

benstrumental3y ago

FWIW, you're comparing a training-specialized chip to an inference-specialized chip. It'd be more apples to apples to compare to TPU v4 lite, but I can't find that chip's details anywhere beyond some mentions in the TPU v4 paper: https://arxiv.org/abs/2304.01433

KRAKRISMOTT3y ago

How does a training specialized chip function? Forward mode is simple, just a dot product machine. But how do you accelerate backprop on hardware? Does it have the vector Jacobian transformation lookup logic and table baked into hardware?

3 more replies

innagadadavida3y ago

Is there something that compares these to more consumer offering like Apple’s ANE?

1 more reply

htrp3y ago

This looks like a customized ASIC specializing solely in recommendation systems possibly focused on ads ranking

>We found that GPUs were not always optimal for running Meta’s specific recommendation workloads at the levels of efficiency required at our scale. Our solution to this challenge was to design a family of recommendation-specific Meta Training and Inference Accelerator (MTIA) ASICs.

flangola73y ago

What a tragic waste of human effort and potential.

throwuwu3y ago

Same thing was said about GPUs when they were just for games

2 more replies

nikhilsimha3y ago

I hope they take comfort in the fact that this is open-source.

seydor3y ago

It's curious why nobody is selling these systems yet

m3kw93y ago

Probably the software needs to be optimized for the hw and also the hw may not be general purpose enough even if offered. People demand nvidia because cuda is very optimized for their gpus and many AI software use cuda

bee_rider3y ago

Competing against NVIDIA must be exhausting.

You come up with a clever ASIC that is better than their current GPU for your workload… and by the time it comes out they’ve released the next year’s chip that just has like 50% more memory bandwidth or something ridiculous like that, and beats you by pure grunt.

“No replacement for displacement” actually seems to be true in compute.

jnwatson3y ago

For the same reason why it took a long time for crypto mining accelerators to actually ship. It is more profitable to keep them for yourself.

latchkey3y ago

This is a popular myth. Bitcoin asic's were 'shipping' in 2012/2013.

Some companies definitely played games and mined with the asic's themselves (and then shipped those used asic's)... but in general, it was always a lot more profitable to sell the shovels than it was to mine the gold.

kccqzy3y ago

Check out https://coral.ai/products/ accelerators you can actually buy.

sebzim45003y ago

Why does the headline just mention inference when the acronym also mentions training?

Is it primarily for inference and the training is just an after thought?

ZiiS3y ago

These seem power and density optimized. This sort of custom hardware is all about supply chains and getting a lot of them everywhere. This flavors the inference use-case. For large training jobs it is more about turn around time; running hideously expensive GPUs sucking down huge amounts of power is fine.

layer83y ago

It looks rather general-purpose (for ML tasks) to me:

Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.

gmm19903y ago

They designed it in 2020 does that mean it is likely to have been in use for a while or is the design lag a few years?

bhouston3y ago

It is ambiguous on that front. If you designed it in 2020, getting through test runs at TSMC and then to a final production run would take a while. So when they had it deployed at scale at FB is unclear.

rektide3y ago

Can OpenXLA/IREE target it? Supposedly PyTorch 2.0's big shift was a switch to these new systems. Curiosity to know if that's actually happened here.

Side note, the chip says Korea on it & I this expected it was Samsung... But it's TSMC made chips? What's up with that?

deepnotderp3y ago

Probably a Korean packaging company

ramshanker3y ago

>>>> fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.

So 2 generation of immediate improvement available.

loa_in_3y ago

TOPS might take a more complex operation as a unit, same as computing shader passes per second might mean a very simple computation or a very complex operation every s^-1

ugjka3y ago

Their goal is to beat GPUs

ZiiS3y ago

Yes, so three nodes ahead.

notfried3y ago

Has there been any rumors or statements from Facebook on them eventually stepping into selling cloud compute? I'd be surprised if they are investing in building hardware accelerators just for their own services.

mgdev3y ago

Their footprint for just their own services rivals some other public clouds.

foverzar3y ago

Given that these chips seem to be power optimised and Facebook's recently released sensory model, I wouldn't be surprised to see them in their next iteration of VR devices.

bradleyjg3y ago

I think they’d be bad at it for the same reason google is bad it. Enterprise sales is not in their dna.

sebzim45003y ago

The AI inference/training market is so competitive that I doubt enterprise sales is going to be the problem. A company planning on spending $50M training a model is not going to be convinced by some smooth talking sales guy over a golf game. They will look at the actual price/performance.

1 more reply

two_in_one3y ago

I want one. This thing can run LLaMA 64b int8 easily.

Meta is going to use it in datacenters, Much more efficient than NVidia generic GPUs. They are serious about putting AI everywhere.

brooksbp3y ago

Why are there so many Mini SMP (?) connectors on the board? (video time 1:21)

villgax3y ago

Just missed FP8 implementation on hardware

tartavull3y ago

How do they compare to TPUs?

0zemp2c3y ago

Just as incredible is the corresponding announcement of their RSC which is purportedly one of the world's most powerful clusters

Amazing times! Private companies now have compute resources previously only showing up in government labs, and in many cases using novel components like MTIA

This feels like the start of a golden age and in a few years we will have incredible results and breakthroughs

j / k navigate · click thread line to collapse

44 comments

bhouston3y ago

Comparing MTIA v1 vs Google Cloud TPU v4:

Googles Cloud TPU v4: 275 teraflops (bf16 or int8), 90/170/192 W. 32 GiB of HBM2 RAM, 1200 GBps. From here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

benstrumental3y ago

KRAKRISMOTT3y ago

3 more replies

innagadadavida3y ago

Is there something that compares these to more consumer offering like Apple’s ANE?

1 more reply

htrp3y ago

This looks like a customized ASIC specializing solely in recommendation systems possibly focused on ads ranking

flangola73y ago

What a tragic waste of human effort and potential.

throwuwu3y ago

Same thing was said about GPUs when they were just for games

2 more replies

nikhilsimha3y ago

I hope they take comfort in the fact that this is open-source.

seydor3y ago

It's curious why nobody is selling these systems yet

m3kw93y ago

bee_rider3y ago

Competing against NVIDIA must be exhausting.

“No replacement for displacement” actually seems to be true in compute.

jnwatson3y ago

For the same reason why it took a long time for crypto mining accelerators to actually ship. It is more profitable to keep them for yourself.

latchkey3y ago

This is a popular myth. Bitcoin asic's were 'shipping' in 2012/2013.

kccqzy3y ago

Check out https://coral.ai/products/ accelerators you can actually buy.

sebzim45003y ago

Why does the headline just mention inference when the acronym also mentions training?

Is it primarily for inference and the training is just an after thought?

ZiiS3y ago

layer83y ago

It looks rather general-purpose (for ML tasks) to me:

gmm19903y ago

They designed it in 2020 does that mean it is likely to have been in use for a while or is the design lag a few years?

bhouston3y ago

rektide3y ago

Can OpenXLA/IREE target it? Supposedly PyTorch 2.0's big shift was a switch to these new systems. Curiosity to know if that's actually happened here.

Side note, the chip says Korea on it & I this expected it was Samsung... But it's TSMC made chips? What's up with that?

deepnotderp3y ago

Probably a Korean packaging company

ramshanker3y ago

>>>> fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.

So 2 generation of immediate improvement available.

loa_in_3y ago

TOPS might take a more complex operation as a unit, same as computing shader passes per second might mean a very simple computation or a very complex operation every s^-1

ugjka3y ago

Their goal is to beat GPUs

ZiiS3y ago

Yes, so three nodes ahead.

notfried3y ago

mgdev3y ago

Their footprint for just their own services rivals some other public clouds.

foverzar3y ago

Given that these chips seem to be power optimised and Facebook's recently released sensory model, I wouldn't be surprised to see them in their next iteration of VR devices.

bradleyjg3y ago

I think they’d be bad at it for the same reason google is bad it. Enterprise sales is not in their dna.

sebzim45003y ago

1 more reply

two_in_one3y ago

I want one. This thing can run LLaMA 64b int8 easily.

Meta is going to use it in datacenters, Much more efficient than NVidia generic GPUs. They are serious about putting AI everywhere.

brooksbp3y ago

Why are there so many Mini SMP (?) connectors on the board? (video time 1:21)

villgax3y ago

Just missed FP8 implementation on hardware

tartavull3y ago

How do they compare to TPUs?

0zemp2c3y ago

Just as incredible is the corresponding announcement of their RSC which is purportedly one of the world's most powerful clusters

Amazing times! Private companies now have compute resources previously only showing up in government labs, and in many cases using novel components like MTIA

This feels like the start of a golden age and in a few years we will have incredible results and breakthroughs

j / k navigate · click thread line to collapse