undefined | Better HN

0 pointsDuskStar7y ago0 comments

If the sample is "all the gas turbines I own", I don't particularly CARE about the bias...

0 comments

If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

DuskStarOP7y ago

Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

TheCoelacanth7y ago

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

throwawaymath7y ago

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.

1 more reply

throwawaymath7y ago

You really should. If the sample is "all the gas turbines you own" and you disproportionately use Siemens sensors, your turbine failure forecast will (with high likelihood) reduce to a Siemens sensor forecast. This is easily plausible even if your sample's correlation between Siemens sensors and gas turbines is completely superfluous.

DuskStarOP7y ago

You can't have a sampling bias when 'sampling' the entire population, because the definition of 'sampling bias' includes 'some members are not included in the sample'.

throwawaymath7y ago

Precisely, yes. I'm talking about a sample including all representative gas turbine failures, across all sensor vendors.

TheCoelacanth7y ago

You can't make predictions when sampling the entire population.

chobeat7y ago

you should, because you might make worse decisions for the business, for the system or for the people that are impacted by the system. If you don't have the right data to decide, don't decide using the data.

j / k navigate · click thread line to collapse

0 comments

TheCoelacanth7y ago

If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

DuskStarOP7y ago

TheCoelacanth7y ago

Next turbine to fail means you sample based on time points, so you still could have sample bias.

throwawaymath7y ago

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.

1 more reply

throwawaymath7y ago

DuskStarOP7y ago

You can't have a sampling bias when 'sampling' the entire population, because the definition of 'sampling bias' includes 'some members are not included in the sample'.

throwawaymath7y ago

Precisely, yes. I'm talking about a sample including all representative gas turbine failures, across all sensor vendors.

TheCoelacanth7y ago

You can't make predictions when sampling the entire population.

chobeat7y ago

j / k navigate · click thread line to collapse