undefined | Better HN

0 pointsabecedarius4y ago0 comments

> very difficult, if not impossible, thing for AI to do right now

Is that right? I thought computer vision was getting pretty good at object detection in the past few years.

0 comments

3 comments · 1 top-level

simonh4y ago· 2 in thread

Those are just image classifiers though. They associate this pixel patters with the character string "table". They might be able to recognise this image as relating to both table and person, but most likely as competing possible classifications. They generally don't have a concept of relatedness between classifications. They have no spacial conceptual model relating concepts like 'above', beside', 'below', distance, etc.

They also have no idea what a table is, no general concept of furniture, or even of a table as a physical or 3 dimensional object. They certainly have no idea how it relates to the human form or what "sitting" on something means. It's literally just pixel pattern => string("table"). That's it.

There may be some experimental models that attempt such things, but the image classifiers that get all the press these days are very good at classification but nothing else.

abecedariusOP4y ago

Image classifiers aren't what I said. BabyTalk in 2013 could take a photo and write out "This is a photo of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa." (All correct.) That's almost prehistoric in neural-network years.

heyitsguay4y ago

The problem is going from results on curated datasets - usually with at least a little implicit overfitting, since even if you reserve test data, it's typically chosen in the dataset creation process - to data in the wild. BabyTalk had that cool result in 2013, and there have been great papers following up in that area, but there's still not, e.g., a website where you can upload a photo and reliably get an accurate description. It's not because nobody's tried, it's because it's hard - neural network vision methods are still generally quite brittle to unexpected variations in data, and "accurate description" is actually a very loaded term that typically needs some sort of in-the-world contextualization to unpack.

j / k navigate · click thread line to collapse