Amazon Machine Learning – Make Data-Driven Decisions at Scale (opens in new tab)

(aws.amazon.com)

251 pointsleef11y ago51 comments

51 comments

Meh. The more I do machine learning in industry the more I realize how little the ML part matters compares to everything else. A typical project I've seen takes 3-6 months and contains thousands lines of code, but the machine learning part will take a week or two and be 100 lines of code. What Amazon ML is doing would probably take an hour and 30 lines of R code you can easily find online.

And here's the not-too-hidden secret: the ML part is the fun part. It's a big reason we spend months creating banking.csv. Josh Willis did a very funny presentation at MLconf partly about this. It's like waiting in line at a theme park for an hour, and then paying someone to cut in line at the last minute and record the ride for you. https://www.youtube.com/watch?v=4Gwf5zsg4vI&feature=youtu.be...

sytelus11y ago

The hardest part in machine learning is not training the model but debugging the model. How do you improve precision/recall after the first cut? Do you need more training data? Is some of your training data bad? Is it properly distributed? Does your feature have bug? Are you missing features to cover some cases? Is your feature selection effective? Did you tuned parameters carefully?

All these scenarios are difficult to debug because it's "statistical debugging". There are no breakpoints to put or watch windows to look at. There is no stack trace and there are no exceptions. Any Joe can train a model given training data, it takes fair bit of genius to debug these issues and push model performance to next level. Unfortunately all these new and old "frameworks" almost completely ignore this debugging part. I think the first framework that has great debugging tools will revolutionize ML like Borland revolutionized programming with its visual IDEs.

sgt10111y ago

This. The pity is that immediately we get the results after a week the project is over and we move back to data wrangling hell!

benhamner11y ago

You hit the nail on the head. Completely agrees with all my experience at Kaggle and applying machine learning across a broad number of industries

firebones11y ago

Maybe it's just me, but the "tedious" feature design and extraction IS the fun part. Am I the only one?

I mean, it's time consuming and frustrating, but it's also the essence of ML work and the place where I get to apply creativity and gain insight.

sputknick11y ago

Agree 100%, in that light, anyone know how far we are away from having data wrangling be more automated? I saw a demo for a product called Paxata a few weeks ago, it looked like a good start. Anyone know more about things like that?

sixdimensional11y ago

There are lots of new attempts at data wrangling approaches/tools, each with different caveats - Datameer, Platfora, Trifacta..

PaulHoule11y ago

I can say that this is my day job now.

fsloth11y ago

I think the "1 part fun 9 parts of perspiration" ratio is typical of most software fields - especially fields working in established industries. That's why dealing with software in a professional context is called a job and not an enjoyable hobby which it otherwise would be :)

crypto511y ago

I think this is one of advertised advantages of deep learning: it will find useful and unobvious features in your data corpus without much effort from your side.

kevinskii11y ago

I think that works in theory, but in many real world cases it actually takes a human to map the data into a subset of salient features. It's not simply a matter of excluding irrelevant dimensions.

1 more reply

Gimpei11y ago

Isn't the point here that you can do it on huge datasets that don't work nicely with R

noelsusman11y ago

There are plenty of tools for that already. The point here is to make it as easy as possible.

I guess this could be useful for some people, but it seems rudimentary to me. If I'm reading their FAQ right they're just fitting a logistic regression to everything. I'm hoping this is just a starting point. Also, not being able to export the actual model seems like a huge dealbreaker to me.

etrain11y ago

My guess is they're using liblinear or vowpal wabbit under the hood. Both support SGD-based learning and work well in a streaming setting where data could be on disk or in memory.

curiously11y ago

do you mean that it takes more work to do the stuff surrouding the machine learning like gathering data to build a dataset that takes months and other resources where as the fun stuff is actually very short and easy to do.

I smell commoditization.

blumkvist11y ago

Can you elaborate?

vmarsy11y ago

Is it just Amazon's catching up with Azure ML launched last year? (And cutting prices by 80%)

Azure ML also supports R and Python custom code, which can be dropped directly into your workspace.

And this was even before Microsoft acquired Revolution Analytics. Amazon ML seems to be less flexible in regards to importing your own models:

Q: Can I export my models out of Amazon Machine Learning?

No.

Q: Can I import existing models into Amazon Machine Learning?

No.

http://blogs.microsoft.com/blog/2014/06/16/microsoft-azure-m...

https://aws.amazon.com/machine-learning/faqs/

http://azure.microsoft.com/en-us/services/machine-learning/

aficionado11y ago

No... it's Amazon ML and Azure ML trying to catch up with BigML. They copied many things from our service but forgot to copy the ease of use. Services like Azure ML, Amazon ML and even Google Predict API work like a black box, and lock your model away, making you extremely dependent on their proprietary service. With BigML, you can easily export your models and use them anywhere for free. If the goal is to democratize machine learning, then the ability to extract your models and use them as you see fit is essential, and only BigML offers that level of freedom.

okisan11y ago

I just try out BigML and look awesome. I use Google Prediciton API to fill a value on form of a web request. I need the result immediately. Why BigML require two web request and take so long to get a prediction of a trained model?

1 more reply

ris11y ago

Yeah sure, why not make your business process depend on a closed proprietary cloud-based product?

(in all fairness Amazon are better than many when it comes to unexpectedly withdrawing products)

psaintla11y ago

I would be less worried about that and more worried about cost. I know of two different startups that aren't profitable but would be if they hadn't put their entire platform on amazon services. One of those startups was lucky enough to be acquired but it's going to take them many unprofitable years to migrate away.

minimaxir11y ago

So the pricing is $100 per million data points, at minimum. That doesn't seem like it scales well for big data at all.

However, that's 5x cheaper than what BigML is offering (https://bigml.com/pricing/credits) for its ad hoc service, so I might be wrong.

aficionado11y ago

BigML cofounder here. Most BigML customers doing machine learning at scale use either BigML subscriptions (starting $30/mo) or private deployments – both of which provide unlimited model training and predictions and are suitable for developers and large enterprises alike. In addition, with BigML you can export your models (for cluster analysis and anomaly detection and not just classification/regression) to run locally and/or to be incorporated in related systems and services.

discardorama11y ago

Did they basically just put a wrapper around VW[1] ?

[1] https://github.com/JohnLangford/vowpal_wabbit

mturmon11y ago

No -- see https://aws.amazon.com/machine-learning/faqs/ --

"Q: What algorithm does Amazon Machine Learning use to generate models?

Amazon Machine Learning currently uses an industry-standard logistic regression algorithm to generate models."

But disappointingly:

"Q: Can I export my models out of Amazon Machine Learning?

No.

Q: Can I import existing models into Amazon Machine Learning?

No."

Note that they are doing classification and regression on iid feature vectors. Of course, ML is much larger than this setting, but this setting is generic enough that it has some applicability to lots of problems.

etrain11y ago

This does not mean they are not using Vowpal Wabbit. It is very easy to run Vowpal Wabbit with a logistic loss function.

Also, vw is what I'd consider "industry standard."

Xorlev11y ago

Or Weka.

BenoitP11y ago

> at scale

I'd say Apache Spark

pinkunicorn11y ago

I am really amazed at the kind of things Amazon turns into a service. And this ML service is just wow'ing. I have fiddled with basic SVM's before, but this takes away the part of writing code and makes it sort of a end user product(you are still expected to know basics about ML). On the other hand, I also don't think this will take off very well. Maybe a few companies/startups who have cash in their pocket will use it/try it out, but the audience is really limited beyond that in my opinion.

gallamine11y ago

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

Honestly, I'd see it the other way around. Small companies without a DS team might be drawn to this. I don't see how any company with a lick of sense would lock down their prediction model into AWS. They very clearly won't let you export your model once the training is done.

minimaxir11y ago

Small companies without a DS team will likely fall into the ML pitfalls which make the resulting analysis invalid.

cfeduke11y ago

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

This would be really nice to use at my startup, but its cost prohibitive even on a very large budget.

I am setting up Spark Streaming to handle model creation and updates for recommendations based on what a user interacts with. If I were to even attempt something similar with this AWS service, its $10 for every 1 million predictions which isn't sustainable (not including the costs to create and update the model).

> but the audience is really limited beyond that in my opinion.

Definitely, largely as a result of cost. I would love to not have to worry about Spark in my infrastructure (its another piece...) but at this price the AWS service is just too expensive.

Xorlev11y ago

Worse than that, isn't it $0.10 per 1000? So $100 for 1 million predictions.

1 more reply

addisonj11y ago

At first glance, this looks to go somewhat beyond Google's Prediction API, which (at least from my experience) is pretty limited in its usefulness.

Its nice to see tools for analyzing your data as well as multi-class classification, and some tune-able parameters but this doesn't seem to bring anything 'new' to the game.

All the hard parts, feature selection, noise, unlabeled data, etc are still up to the end user, which makes me wonder how many people will try this out and get poor results.

It would be nice to get an idea of what sort of model they are using on the backend or even having a choice of models.

gallamine11y ago

The system also uses logistic regression and is limited to 100 gb dataset. Prediction with LR isn't that expensive and training can be done online with something like stochastic gradient descent. That can be done on a single computer. Given that the models aren't exportable and you can import a model, I'm hard pressed to see the immediate value. Long term, though, I'm sure there's plenty of growth.

huac11y ago

It's kind of unclear, but it looks from the screenshots as if AWS is doing feature selection behind the scenes. But it seems that unless AWS does feature selection or model selection really efficiently behind the scenes, the cost of that extra work time is placed on the user.

alooPotato11y ago

What differences did you notice beyond Google's Prediction API?

addisonj11y ago

This may be different now, but when I used Prediction API a few years ago, I don't remember it having any data analysis tools or multi-class classification. The UI was also pretty lacking. Haven't looked at in a while but perhaps it has gotten better?

aficionado11y ago

Did anyone actually give it a try? I only get this error with any dataset (even a humble Iris): Amazon ML cannot create an ML model: 1 validation error detected: Value null at 'predictiveModelType' failed to satisfy constraint: Member must not be null

mloudon11y ago

go to the datasources tab and see if there's an error message from data source creation. i had the same error due to an issue with variable names.

saurabhtandon11y ago

I like the "Introduction to Machine Learning" which sort of briefly outlines the basics of machine learning for people who don't know about it.

orionblastar11y ago

I predict we will see more cloud based machine learning services. Since machine learning is hard to learn and write for the average person, providing the services will greatly help them.

It would be good if there were an open source tool like Libreoffice that does Machine Learning in their spreadsheet app. It would be a good feature to add, and then the competitors would have to add it to their software as well.

chrischen11y ago

Google's competing product: https://cloud.google.com/prediction/docs

sandstrom11y ago

Cannot find it (in N. Virginia)? Is that only me?

(if anyone has the direct link for the console, please share :)

dbarlett11y ago

https://console.aws.amazon.com/machinelearning/home?region=u...

sandstrom11y ago

Thanks!

(weird, still doesn't show in the menu)

1 more reply

rcpt11y ago

Some have already taken this kinda thing a few steps further:

http://www.automaticstatistician.com/

j / k navigate · click thread line to collapse

51 comments

rm99911y ago

sytelus11y ago

sgt10111y ago

This. The pity is that immediately we get the results after a week the project is over and we move back to data wrangling hell!

benhamner11y ago

You hit the nail on the head. Completely agrees with all my experience at Kaggle and applying machine learning across a broad number of industries

firebones11y ago

Maybe it's just me, but the "tedious" feature design and extraction IS the fun part. Am I the only one?

I mean, it's time consuming and frustrating, but it's also the essence of ML work and the place where I get to apply creativity and gain insight.

sputknick11y ago

sixdimensional11y ago

There are lots of new attempts at data wrangling approaches/tools, each with different caveats - Datameer, Platfora, Trifacta..

PaulHoule11y ago

I can say that this is my day job now.

fsloth11y ago

crypto511y ago

I think this is one of advertised advantages of deep learning: it will find useful and unobvious features in your data corpus without much effort from your side.

kevinskii11y ago

I think that works in theory, but in many real world cases it actually takes a human to map the data into a subset of salient features. It's not simply a matter of excluding irrelevant dimensions.

1 more reply

Gimpei11y ago

Isn't the point here that you can do it on huge datasets that don't work nicely with R

noelsusman11y ago

There are plenty of tools for that already. The point here is to make it as easy as possible.

etrain11y ago

My guess is they're using liblinear or vowpal wabbit under the hood. Both support SGD-based learning and work well in a streaming setting where data could be on disk or in memory.

curiously11y ago

I smell commoditization.

blumkvist11y ago

Can you elaborate?

vmarsy11y ago

Is it just Amazon's catching up with Azure ML launched last year? (And cutting prices by 80%)

Azure ML also supports R and Python custom code, which can be dropped directly into your workspace.

And this was even before Microsoft acquired Revolution Analytics. Amazon ML seems to be less flexible in regards to importing your own models:

Q: Can I export my models out of Amazon Machine Learning?

No.

Q: Can I import existing models into Amazon Machine Learning?

No.

http://blogs.microsoft.com/blog/2014/06/16/microsoft-azure-m...

https://aws.amazon.com/machine-learning/faqs/

http://azure.microsoft.com/en-us/services/machine-learning/

aficionado11y ago

okisan11y ago

1 more reply

ris11y ago

Yeah sure, why not make your business process depend on a closed proprietary cloud-based product?

(in all fairness Amazon are better than many when it comes to unexpectedly withdrawing products)

psaintla11y ago

minimaxir11y ago

So the pricing is $100 per million data points, at minimum. That doesn't seem like it scales well for big data at all.

However, that's 5x cheaper than what BigML is offering (https://bigml.com/pricing/credits) for its ad hoc service, so I might be wrong.

aficionado11y ago

discardorama11y ago

Did they basically just put a wrapper around VW[1] ?

[1] https://github.com/JohnLangford/vowpal_wabbit

mturmon11y ago

No -- see https://aws.amazon.com/machine-learning/faqs/ --

"Q: What algorithm does Amazon Machine Learning use to generate models?

Amazon Machine Learning currently uses an industry-standard logistic regression algorithm to generate models."

But disappointingly:

"Q: Can I export my models out of Amazon Machine Learning?

No.

Q: Can I import existing models into Amazon Machine Learning?

No."

etrain11y ago

This does not mean they are not using Vowpal Wabbit. It is very easy to run Vowpal Wabbit with a logistic loss function.

Also, vw is what I'd consider "industry standard."

Xorlev11y ago

Or Weka.

BenoitP11y ago

> at scale

I'd say Apache Spark

pinkunicorn11y ago

gallamine11y ago

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

minimaxir11y ago

Small companies without a DS team will likely fall into the ML pitfalls which make the resulting analysis invalid.

cfeduke11y ago

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

This would be really nice to use at my startup, but its cost prohibitive even on a very large budget.

> but the audience is really limited beyond that in my opinion.

Definitely, largely as a result of cost. I would love to not have to worry about Spark in my infrastructure (its another piece...) but at this price the AWS service is just too expensive.

Xorlev11y ago

Worse than that, isn't it $0.10 per 1000? So $100 for 1 million predictions.

1 more reply

addisonj11y ago

At first glance, this looks to go somewhat beyond Google's Prediction API, which (at least from my experience) is pretty limited in its usefulness.

Its nice to see tools for analyzing your data as well as multi-class classification, and some tune-able parameters but this doesn't seem to bring anything 'new' to the game.

All the hard parts, feature selection, noise, unlabeled data, etc are still up to the end user, which makes me wonder how many people will try this out and get poor results.

It would be nice to get an idea of what sort of model they are using on the backend or even having a choice of models.

gallamine11y ago

huac11y ago

alooPotato11y ago

What differences did you notice beyond Google's Prediction API?

addisonj11y ago

aficionado11y ago

mloudon11y ago

go to the datasources tab and see if there's an error message from data source creation. i had the same error due to an issue with variable names.

saurabhtandon11y ago

I like the "Introduction to Machine Learning" which sort of briefly outlines the basics of machine learning for people who don't know about it.

orionblastar11y ago

I predict we will see more cloud based machine learning services. Since machine learning is hard to learn and write for the average person, providing the services will greatly help them.

chrischen11y ago

Google's competing product: https://cloud.google.com/prediction/docs

sandstrom11y ago

Cannot find it (in N. Virginia)? Is that only me?

(if anyone has the direct link for the console, please share :)

dbarlett11y ago

https://console.aws.amazon.com/machinelearning/home?region=u...

sandstrom11y ago

Thanks!

(weird, still doesn't show in the menu)

1 more reply

rcpt11y ago

Some have already taken this kinda thing a few steps further:

http://www.automaticstatistician.com/

j / k navigate · click thread line to collapse