There's a concept in animal behaviour called a 'home range' which is more or less the same thing - GPS attached to tigers in the wild etc. Some of the algorithms there are quite interesting, from simply drawing a bounding box around the the data points, to working out the probability density, to things like LoCoH, which sort of recursively build up convex hulls from nearest neighbours.
All of these things are pretty much possible in SQL to one degree of performance or another. But ultimately I'm fascinated by things like SQL Server's R support - you can get far simpler, more natural implementations of these things in R (or indeed in custom aggregates or functions in other languages). I think in the long term, database engines that offer this sort of extensibility are going to thrive for analytics work, be they SQL based or otherwise.
* They call every use a "data scientist". This is fun and assuredly deserved for R experts but is going to scare a hell of a lot of potential users away.
* The installation notes give you a lot of steps but zero insight into why you're doing them. Install these two packages; okay but what exactly are they doing? We need to know this kind of thing. R people might know...
* There's very little troubleshooting documentation either on the installation/initial setup (and there are a LOT of bugs in getting it going; trust me I spent most of last week digging through them) but also once it's operational how it's going to be tracked and managed within the context of everything else that's going on in the SQL instance. Like the memory pools, wait states, resource governor pools, etc. We kind of need to know this stuff. Otherwise we see a server that gets reported as slow, we know it uses R, but we don't know where to look to determine whether R is causing it or not and what we can do about it.
* AFAIK it's single-threaded. Considering most places have super-downsized their CPUs and are going for massively threaded servers these days (which is arguable given Microsoft's 2012+ per-core licensing model; but I'm talking about servers in very big enterprises that have unlimited licensing agreements), that's not going to end well. It's very possible this will be fixed before RTM though.
* And finally, because they've bought revolution and rebranded it as Microsoft R a few days ago, and now it's split into SQL and non-SQL products, there's going to be a lot of confusion and a long wait for appropriate training materials to catch up for all of us non-R users.
* I haven't been impressed with the current tutorial materials. As a non-R user we're going to need a lot more to really understand how to use it. I'm sure there are fantastic R resources available but if you want to understand it with a Microsoft SQL Server background then it's a slightly different story...
We used a custom PowerBI visualization (d3 + TypeScript). For dev purposes I just averaged data with some simple SQL queries, but the real thing used (I believe) 10k datapoints fore each player over the course of the match.
Also, none of those queries will work on MySQL.
`jplitz and `ajones are right. As of now we support MySQL, Postgres, Amazon Redshift, Vertica, SQL Server, Oracle, MemSQL, Sybase, Exasol and Google BigQuery. We add more all the time.
I've never used Periscope but have followed and read the blog for a while. Always great content.
The code would work on most engines that support CTEs, with some changes to the type casting syntax.
Thus most except MySQL.
The blog content is good though
That said, I do have one major gripe with their sales process: Their free trial lasts 7 days. There is no way a company can evaluate a completely new BI system in a week. I begged and pleaded and they would not extend the trial.
I'm not just saying this, but I really really do think we would have bought it if we could have had 30 days. I just can't build enough organizational momentum to spend money on a new system in 7 days.
What are credit card companies doing? What's the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?
One way to model this might be to treat the outcome ("Is this event an outlier?") as a 0/1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn't really "outlier" detection anymore.
I don't know what you mean by combining multiple variables. Do you mean analysis methods that work with multiple variables (instead of 1-dimensional z-scores) or do you mean methods that combine multiple variables into 1, to reduce input dimensions (i.e. principal component analysis)
Because data and data reporting platforms are so different across companies, there's no 'simple framework' to do reporting. You probably want something like https://github.com/etsy/skyline.
You also describe an ensemble method for outlier detection, which is what Skyline uses. I want to note that there is no reason to consider ensembles "not" outlier detection.
The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a much harder problem to 'generically' detect outliers without knowledge of the underlying distribution.
I don't have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?
Here's an example - expected variance for the number of SWIFT payments processed during non-banking hours is 0. Transaction counter greater than 0 is an outlier.
I think when the article begins it should first provide code that can validate that the values you're going to give the function will fit within a normal distribution that makes the outlier detection worthwhile. Is that possible?
Trying to pick out "bad" stats from the gigabytes of information I gather from the servers I manage is always a top-of-the-list thing that I want to do.
But when we're talking about 95%; well I mean it's fine for showing on a graph but how can you run alerting for something like that?