undefined | Better HN

0 pointsHarHarVeryFunny6d ago0 comments

The auto labeling work (which has been partially described/presented at Tesla AI day events) seems more like engineering than research, a grab bag of techniques that I would guess the whole team must have contributed to. For example, they auto label low resolution/indeterminate objects (image segments) by temporal continuity... Something that is a low-res blob in the distance becomes a hi-res and easy to identify object when you drive by it, so by tracking objects backwards across frames you can learn how to more confidently label the lo-res blob. Things like this are useful, but it's the sort of stuff that engineers and developers are coming up with every day.

0 comments

ricardobeat6d ago

Not back in 2016.

HarHarVeryFunnyOP6d ago

You don't think that tracking objects from frame to frame is obvious ?!

I can guarantee you this was built-in from day #1

I'm guessing you're not a developer if you don't then automatically think of end cases like "what if car # 1 isn't in the preceding frame" ... (then you look at some relevant test data and see it was there, unlabelled ...)

ricardobeat5d ago

Obvious in hindsight and obvious at the time are very different things.

You seem to have missed the main point anyway - using a larger model to generate labels for a smaller one is what the parent was highlighting, not the temporal labeling alone. The gold standard at the time was human labeling (eg Waymo). Deep learning was just having its moment, all of this stuff was cutting edge, and there is a lot of work in between a published paper and actually applying that to production vehicles.

HarHarVeryFunnyOP5d ago

Yes, the automated labelling (which replaced a large team they had doing manual labelling) that Tesla implemented consisted of a bunch of different things.

Generating a training set, training on it, and then inferencing on the trained model are three different things.

1) Generating the auto-labelled training set was of course done on Tesla's supercomputer, based on data from 1000s of cars.

2) Using the generated training set to train the in-car model would also be done offline.

3) The trained (and tested) model is then deployed to the car and used by the vision system to label image segments ("stop sign", "cyclist" etc).

How could this be divided up any other way?!

Karpathy seems like a great guy, but honestly there seems to be little to nothing in his background that makes him stand out as an architecture guy or being very creative. Maybe his thesis on image captioning is his most creative work, but at the end of the day this consisted of feeding the output of a CNN into an LSTM, conceptually very similar to the way language translation was being done at the time by feeding the output of an encoder LSTM for language A into a decoder LSTM for language B, except Karpathy was using an image encoder (off the shelf CNN) since he wanted to describe (caption) images. It was certainly at least somewhat innovative at the time, but what he was really famous/popular for at Stanford was for teaching the CS 231n class on using CNNs, and this is what he continues to be best known for - explaining how things work.

j / k navigate · click thread line to collapse