Project CEMAPRE internal
Title | Classification and clustering of time series with data-driven fragmented statistics |
Participants | Jorge Caiado (Principal Investigator), Nuno Crato |
Summary | The classification and clustering of time series involve defining a relevant metric, employing machine learning algorithms or specific clustering techniques, and interpreting the results to gain insights into the underlying commonalities and structures and patterns within the time series data. Many featured-based methods have been developed to address the problem of clustering noisy raw time series data. Methods based on features extracted in the time domain, frequency domain, and from wavelet decomposition of the time series are discussed in the literature (Maharaj, D’Urso and Caiado, 2019). These involve extracting autocorrelation, partial autocorrelation, cross-correlation and periodogram ordinates features from time series data to compute distance metrics. We know that both the autocorrelation, ACF, and the periodogram of a given time series describe its linear dependence structure and hence they are a good representation of the dynamics of many real time series. For such purpose, though, it is crucial to identify the relevant autocorrelation lags or the determinant frequencies that contribute to the discriminative power for classifying different time series. Along these lines, two successful approaches are the fragmented periodogram method proposed by Caiado, Crato, and Poncela (2020) and the fragmented autocorrelation method proposed by Albino, Caiado, and Crato (2024). The first uses the periodogram only around main driving frequencies of the time series; the second uses the ACF around specific lags of interest for clustering. While effective with known data generation processes, these methods may be less reliable when information on the time series structure is unknown. To overcome this limitation, we propose to develop metrics to calculate the distance between time series using only their significant periodogram ordinates or significant autocorrelations. This entails defining a significance threshold to retain relevant frequencies and autocorrelations and filter out the noise. For this purpose, we propose elaborating the theory of both methods to incorporate the data-driven fragmentation and conducting a simulation study with time series generated by linear models (ARMA, ARIMA, and SARIMA) and illustrating the concept using real data from economic and financial time series. References: Albino, A., Caiado, J. and Crato, N. (2024): “Big-data time series clustering using fragmented autocorrelations”, working paper. Caiado, Jorge, Nuno Crato, and Pilar Poncela (2020). “A fragmented-periodogram approach for clustering big data time series”. Advances in Data Analysis and Classification, Vol. 14: pp. 117–146 E.A. MAHARAJ, P. D'URSO and CAIADO, J., (2019). Time Series Classification and Clustering, CRC Press, Taylor & Francis Group, United States. |