undefined | Better HN

0 pointsborroka4y ago0 comments

But the OP was asking something different, that is why someone should excessively focus on theory, when, by the way, DL theory is very far from being solid and trial and error in ML and AI is the common way of operating.

The "model is in place, but I have no clue what's doing and so it can fail without me understanding when and how is straw-man". Especially for supervised learning, that is, we have a label for data, it is immediately clear whether the output of the model is "bunk, useless, or even harmful". There is no "fail silently by design".

I have been working in the field for almost 20 years in academia and in industry and it is not that I starting every PCA thinking about eigenvectors and eigenvalues and if you ask me now without preparing what are those, I would be between approximately right and wrong. But I fit many, many very accurate models.

0 comments

uoaei4y ago

You are considering only the technical aspects of the model. While of course important to understand, those are less interesting when considering potential harms than the downstream effects of the inference pipeline, particularly when it comes to interpretations of outputs. What is absolutely the worst possible MO is to offload the interpretation portion of a pipeline to a machine using proxy metrics without an exceptional model which justifies the approach unequivocally.

For instance, if we put an MSE loss function on a classification NN with sigmoid outputs, and used a classification dataset, we could generate an entire zoo of "many, many very accurate models" as measured by MSE. But once your model returns outputs, how do you interpret them to predict a label for some input data? You could hack some algorithm together (eg argmax of the highest value) which is indistinguishable from the "correct" procedure but the described probabilities are so incorrect that no ML professional would be comfortable trusting anything it says, not least because of the violation of the condition that the probabilities are non-negative and sum to one. But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to.

spekcular4y ago

"But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to. "

What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?

uoaei4y ago

A lot of processes downstream to inference benefit from having a minimum of care put into the system design. We're talking 80/20 rule stuff here. It's a simple reorientation vs a janky argmax-classifier, but results in assumptions being obeyed broadly, in a max-entropy sense.

The key insight is that all prediction models can equally be framed as energy-based models (y = f(x) -> E = g(x, y)) and the job of ML is to estimate the joint distribution of x and y with suitable max-entropy surrogate distributions, and performing MLE on this variational distribution vs some training data. All the math in the theory follows from this (perhaps excluding causal stuff but actually I am not familiar enough with those techniques to say for sure). Things get a little more complicated when you consider e.g. autoencoders but above still holds.

Obviously with the choice of a poor surrogate distribution, your predictions will on average be worse. Yes, even if you don't care about probabilities and just want max-likelihood predictions -- your predictions will on average be worse. By construction, analysis proceeds by framing the problem as this and following through. A janky argmax-classifier is not exempted from this -- it, too, already implies a surrogate distribution, but you know, statistically speaking, it's probably a pretty bad one. So it makes sense to put a tiny bit more effort to get way closer to representing the space that your data lives in.

Naturally, you could easily find a janky model that outperforms some relatively unoptimized principled model on a specific use case, and many do get lucky with this. But the principled model has a lot more headroom specifically in terms of the information it can hold, because if the design is more or less correct to the problem specification then the inductive bias built into the model matches closely with the structure of the data which is observed.

2 more replies

borrokaOP4y ago

What you described seems to me pretty standard in ML and even more in statistical modeling. Maybe because I am coming from applied math and statistics.

j / k navigate · click thread line to collapse

0 comments

uoaei4y ago

spekcular4y ago

uoaei4y ago

2 more replies

borrokaOP4y ago

What you described seems to me pretty standard in ML and even more in statistical modeling. Maybe because I am coming from applied math and statistics.

j / k navigate · click thread line to collapse