Project CEMAPRE internal
|Title||Classification and clustering of big data time series with spectral measures|
|Participants||Jorge Caiado, Nuno Crato (Principal Investigator)|
|Summary||The big data revolution is now offering researchers and analysts new possibilities and new|
challenges. This is particularly true with time series, as for many domains we now have access to
very long time series and to many time series related to a given domain of interest. This happens in
areas as diverse as astronomy, geophysics, medicine, social media, and finance.
In astronomy, for instance, we now have long and diverse series of star magnitude and spectra,
radio-astronomy signals, asteroid position measurements, and other records. In medicine, we have
very long and multiple records of physical activity indicators, heart rate, and other biological
features. In social media and social studies, we have long records of human interactions, from
administrative data to internet activities. In finance, we have tic-by-tic data of asset prices from
many markets and firms.
The diversity and length of data available to researchers leads to particular challenges when
comparing and clustering time series. For these tasks it is not usually possible to use traditional
methods of analysing, estimating models, and comparing features, as these methods imply computing
and inverting extremely large matrices.
We propose and will investigate a spectral method of synthesizing and comparing time series
characteristics which is nonparametric and focused on the data cyclical features. Instead of using
all the information available from data, which is computationally very expensive, this procedure we
will use regularization rules in order to select and summarize the most relevant information for
clustering purposes. This method does not imply the computation of the full periodograms, but only
of the periodogram components around the frequencies of interest. It then proceeds to comparing the
periodogram ordinates for the various time series and grouping them with common clustering methods.
We call it a fragmented-periodogram approach.