The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers.
I am coming from a genomics background, so when I read about sliding window on time series data, I think about sliding window approach on regions of genome sequences.
The ultimate point of doing this clustering is to find repetitive sub-sequences in the series. This is also something we do commonly in the genomics field (repeatmasker/repeatmodeler software for example).
Something we can look at in genomes is a "k-mer coverage". A k-mer is just a k-lengthed sub-sequence (A,T,G,C), analogous to a k-length sub-sequence of a time series.
By scanning the genome, we can tally up how many times a k-mer appears in the genome. With this tally, we can determine a k-mer coverage for a region on the genome. This gives us an idea how many times we see this same region on the rest of the genome.
Maybe this approach can be adapted somehow? The only problem is that in genomics, we have four discrete classes (A,G,T,C), making finding exact matching relatively easy. In time-series, we have continuous data, making finding matches a lot tougher to do and define.
The difference I think is that the series in a genome is lexicographic rather than temporal: i.e. which end of a particular strand you read from is irrelevant. On the other hand, a time series has an independent ordering. By analogy, a time series is a directed graph a genome is undirected.
That is the algorithm for finding a subsequence in a time series can be used for finding a subsequence in a genome. But the meaning of a subsequence is different. It's:
A - G - A - C
versus C-> A -> G -> A.
Time series data contains implication and causality. Any claim about time series data is implicitly a claim about causality: e.g. we might claim a time series appears to be random. We don't really talk about a genome's amino acid sequence being random because the causality (other than the trivial case of analogues) lies outside the sequence - the reason for T - A - C
is assumed to lie outside of T - A - C.[1]: http://plato.stanford.edu/entries/kant-hume-causality/#TimDe...
"As the sliding window passes by, the datapoint first appears as the rightmost value in the window, then it goes on to appear exactly once in every possible location within the sliding window. So the t_i datapoint contribution to the overall shape is the same everywhere..."
"Another way to look at it is that every value v_i in the mean vector, 1 ≤ i ≤ w, is computed by averaging essentially every value in the original time series; more precisely, from t_i to t_m-w+i . So for a time series of m = 1024 and w = 32, the first value in the mean vector is the average of t[1..993]; the second value is the average of t[2…994], and so forth. Again, the only datapoints not being included in every computation are the ones at the very beginning and at the very end, and their effects are negligible asymptotically."
(he does have interesting stuff here: http://www.cs.ucr.edu/~eamonn/selected_publications.htm)
I was under the impression this work was more well-known, though. It's surprising to not see it mentioned considering how long DTW has been around.
Like assuming your data come from a normal distribution? Box guides us: 'all models are wrong, some are useful.'
> While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.
Ok, I would love to see a simple illustration here :)
For example, think about using regression instead of clustering. Your design matrix is still going to be S from the paper, where the p-th row of S is the p-th "iteration" of the window (while it is is being slid from start to finish of the data).
So column 1 of p=1 window will trivially be perfectly correlated with column 2 of p=0 window, and column 0 of p=2, at least for a number of rows until that data point has slid out of the window. So in some sense, past observations (earlier columns in S) will "cause" later observations (later columns in S) due to the sliding mechanism.
In regression this is a well-explored problem with multi-collinearity and also with conditional effect sizes when causal relationships are not separated into separate equations or corrected by adding dummy variables for conditional levels of one covariate given others.
It's not at all surprising that something similar would happen when clustering.
Sadly, it's also not that surprising that the entire research community doesn't care too much and is happy to crank out papers with this issue anyway. It's similar to data mining for statistical significance or other problems, and indeed machine learning is not at all a silver bullet to get around those problems.
One approach that might help reduce the problem is to randomly sub-sample windows. It would be a lot of work, and probably be computationally costly, but you could in principal devise a sampling scheme where you jointly sample all of the sub-windows at once, in such a way as to ensure certain bounds on the total amount of overlap between them, using probably some sort of Metropolis-like scheme. It's not clear to me if this is worthwhile or not, and I tend to prefer the idea of quantizing the signal in some way to make it a lexicographic problem instead.
Also note that the problem is a lot less worrisome in supervised learning, like training a decision tree or SVM on overlapping sub-windows. The reason is that you (presumably) have control over the labels. So you can ensure that two highly correlated sub-samples (like window p and window p+1) get the same label most of the time when it matters (e.g. when you're right on top of the interesting phenomenon you want to classify). Crucially the reason this helps is that you not only can ensure that two sample windows which differ by only a slight amount of sliding will get the same label, but also that those windows will get the same label as other instances of the same phenomenon elsewhere in the signal, perhaps far away in some other sub-window nowhere near it. It means you are clamping the positive training samples to reflect the signal properties you want, rather than hoping the clustering will intrinsically pick that to be the notion of similarity it uses when classifying samples (which, as the paper points out, you have no reason to expect that it will).
But it will still suffer somewhat to subjective choices about when, exactly, the window has slid far enough away from the labeled phenomenon to have "rolled off" enough that the label should be for the other class (in a binary case). What's bad about this is that however this gets codified into the training set needs to be taken into account as part of the experimental design and accounted for when reporting anything like statistical significance results later on. But because this choice, which is effectively a "roll off" parameter, is not written down or made explicitly part of the model anywhere, most people just ignore, or don't think about the way it might have an effect on the result.