Significant Pattern Mining for Time Series

In the midst of the deep learning hype, $p$-values might not be the hottest topic in data science. However, association mapping remains a fundamental tool to justify and underpin scientific conclusions. Inspired by an approach for time series classification based on predictive subsequences (i.e shapelets [1]), we developed S3M, a method that identifies short time series subsequences that are statistically associated with a class or phenotype while tackling the multiple hypothesis problem.

Why care?

Assume you want to understand the physiological response to artificial sweeteners in terms of insulin levels. You hypothesize that the metabolic response of people with high BMI differs from subjects with a low BMI. More precisely, you conjecture observing a sharp insulin increase followed by a slow decrease back to normal in one cohort, and a sharp increase followed by a steep decrease in the second group. In other words, your null hypothesis is: There is no relation between insulin clearance rate and BMI. Given the insulin concentration measured over time and the patient’s BMI group (high or low) our method can identify insulin trajectories (e.g. sharp drops) that are statistically association with the BMI.

A different biomedical motivation can be found in our publication.

Time series (TS) and their subsequences

Let us briefly formalize our problem: We are given a data set $\mathcal{D}=\{(T_0, y_0), \dots, (T_n, y_n)\}$, with $T_i \in \mathbb{R}^{1 \times m}$ and $y_i \in \{0,1\}$. In other words, our data set contains $n+1$ patients and their real–valued measurements of length $m$. Each patient belongs either to class $y=0$ (low BMI) or $y=1$ (high BMI).

Given a fixed subsequence length $w$, we can now extract $m-w+1$ subsequences from a single time series. A subsequence $S$, which originates from time series $T=\{t_0, …, t_{m-1} \}$, consists of the set of $w$ contiguous measurements $S=\{t_p, \dots, t_{p+w-1}\}$ where $p$ satisfies $0 \leq p \leq m-w$. Simply put, if we are interested in subsequences of length 10 and our measurements are 100 time points long, we cut our TS into 91 pieces, each of which is of length 10.

Time series distances

Let’s assume we have collected a set of subsequences from all patients in our data set. In order to set up a contingency table (which we will need for our statistical test), a measure of occurrence of a subsequence is needed. To do so, we first define the distance of two TS of length $m=|A|=|B|$ as their squared Euclidean distance:

$$dist_{euc}(A, B)=\sum_{i=0}^{m-1}|A_{i}-B_{i}|^2$$

Since we want to make a statement about whether a given subsequence $S$ of length $w$ occurs in a – possibly much longer – TS $T$, we need to define a distance $dist(S,T)$ where $|S| < |T|$. A straightforward approach to do so, is to slide $S$ over $T$ and at each position calculate $dist_{euc}$ over the aligning parts. We end up with a bunch of distances of which we choose the smallest one as our subsequence distance:

$$dist(S,T)=\min_{j}dist_{euc}(S, \{t_j, \dots, t_{j+w}\})$$

Figure 1 illustrates the distances we obtain when sliding a subsequence over a TS. Since the shown subsequence originates from the blue TS, their distance is $0$. To determine whether a subsequence is present in a time series, we can now set a distance threshold $\Theta$ and say

$$S \text{ occurs in }T\text{ if }dist(S,T)\leq \Theta.$$

Figure 1: Illustration of subsequence distances between a red subsequence and a much longer TS. At each position $dist_{euc}$ is calculated. If $dist_{euc}$ at any position is smaller than a given threshold, we say $S$ occurs in $T$. The subsequence turns green if this is the case.

Contingency tables

Now that we have a notion of subsequence distance and occurrence, we can set up a contingency table, and perform hypothesis tests such as $\chi^2$ or Fisher’s exact test. Table 1 illustrates the contingency table for a subsequence whose statistical association we want to test. Given a specific threshold $\Theta$, cells $a_S$ and $d_S$ count the occurrences of $S$ in all time series from subjects that exhibit the phenotype ($y=1$) and ones that do not ($y=0$), respectively. Vice versa, $b_S$ and $c_S$ count the non-occurrences.

Class label $dist(S,T) \leq \Theta$ $dist(S,T) > \Theta$ Row totals
$y=1$ $a_S$ $b_S$ $n_1$
$y=0$ $d_S$ $c_S$ $n_0$
Column totals $r_S$ $q_S$ $n$
Table 1: Contingency table as used in S3M.

For the sake of being self-contained, we can now calculate the $\chi^2$ test statistic: $$T_{\chi^2}(n,a_S, b_S, c_S, d_S)=\frac{n(a_Sc_S-b_Sd_S)^2}{(a_S+b_S)(c_S+d_S)(a_S+d_S)(b_S+c_S)}$$

and its $p$-value, using the CDF of a $\chi^2$-distribution with one degree of freedom $F_{\chi^2}$: $$p(S,\Theta)=1-F_{\chi^2}(T_{\chi^2}(n,a_S, b_S, c_S, d_S))$$

Choosing $\Theta$

As of now, we have not thought about how to choose a threshold at which we want to construct the contingency table. In theory, all real numbers are potential thresholds. To make our lives easier, we restrict ourselves to thresholds that have the potential of changing the cells of the contingency table and therefore the respective $p$-value. A simple way of thinking about this problem is to consider the number line below. Each rectangle represents a TS, its position the distance to a given subsequence, and its color class membership. The thresholds for each of which we construct and evaluate a contingency table are visualized as small black circles. We can think of the thresholds as cutoffs. Each cutoff will lead to a different contingency table, and potentially to a different $p$-value as shown in Figure 2.

Figure 2: Schematic illustration of time series distances along the number line. Each black dot represents a threshold at which we construct a contingency table and calculate a $p$-value. The entries of the contingency table are shown above the number line.

Multiple hypothesis testing

As we can see from Figure 2, for a single subsequence and a data set containing 13 TS, we perform 12 hypothesis tests. Going back to the scenario with TS of length 100, subsequence length $w=10$, and assuming a data set with 300 patients, we extract $300 \times 91=27{,}300$ subsequences. For each subsequence, a statistical test will be performed at $299$ thresholds. Hence, we end up with $299 \times 27{,}300=8{,}162{,}700$ hypotheses to be tested. Under a typical significance threshold of $\alpha=0.05$, we would expect to see $408{,}135$ incorrect rejections of the null hypothesis, a.k.a. false positives. For finding biomedically relevant temporal biomarkers, this is unacceptable.

One popular way of counteracting this problem is Bonferroni correction where we divide the significance threshold $\alpha$ by the number of performed tests. This, however, can lead to extremely conservative thresholds and dramatically reduce statistical power. We employ an iterative and less conservative approach to correct for this problem using Tarone’s trick [2] for discrete test statistics. This approach is based on the concept of the “minimal attainable $p$-value” $p_{min}$:

“If we know $r_S$ (e.g. how often $S$ occurs in the data set), $n_1$, and $n$, we can construct the contingency table that will lead to the smallest $p$-value this subsequence can ever reach $\rightarrow p_{min}$.”

Note, that we can calculate $p_{min}$ without knowing the actual occurrence in both classes ($a_S$ and $d_S$) and it can be calculated in a closed form. Now, if $p_{min}$ is larger than a given significance threshold, we can be certain that this subsequence will never be significant under this (and any lower) threshold and hence, can never lead to a false positive and does not contribute to the FWER. This means we do not have to correct for it. We call this subsequence-threshold pair an untestable hypothesis.

The difference this approach can make for identifying temporal biomarkers of vital signs in patients with sepsis is shown in Table 2 of our manuscript.

Wrapping it up

  • We developed a method that allows for association mapping of time series subsequences and clinical/biological phenotypes.
  • We maintain statistical power by utilizing an iterative approach to correct for the problem of multiple hypothesis testing.
  • While we were motivated by a clinical problem, S3M can be applied to time series from any domain.

One last remark: In our initial experiments, we observed that choosing subsequences of high statistical significance for classification can also yield good predictive performance. However, the motivation for studying statistical significance vs. optimizing classification performance is driven by fundamentally different inducements.

References

I only list the most fundamental publications in this blog post, other relevant manuscripts are cited in our publication.

  1. Mueen, A. et al. (2011). Logical-Shapelets: An expressive primitive for time series classification. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, USA, pp. 1154–1162.
  2. Tarone, R.E. (1990) A modified Bonferroni method for discrete data. Biometrics, 46, 515–522.