Yes, the automated labelling (which replaced a large team they had doing manual labelling) that Tesla implemented consisted of a bunch of different things.
Generating a training set, training on it, and then inferencing on the trained model are three different things.
1) Generating the auto-labelled training set was of course done on Tesla's supercomputer, based on data from 1000s of cars.
2) Using the generated training set to train the in-car model would also be done offline.
3) The trained (and tested) model is then deployed to the car and used by the vision system to label image segments ("stop sign", "cyclist" etc).
How could this be divided up any other way?!
Karpathy seems like a great guy, but honestly there seems to be little to nothing in his background that makes him stand out as an architecture guy or being very creative. Maybe his thesis on image captioning is his most creative work, but at the end of the day this consisted of feeding the output of a CNN into an LSTM, conceptually very similar to the way language translation was being done at the time by feeding the output of an encoder LSTM for language A into a decoder LSTM for language B, except Karpathy was using an image encoder (off the shelf CNN) since he wanted to describe (caption) images. It was certainly at least somewhat innovative at the time, but what he was really famous/popular for at Stanford was for teaching the CS 231n class on using CNNs, and this is what he continues to be best known for - explaining how things work.