Want to use AI in Time-Series? Avoid this common mistake!

Victor Silva, M. Sc.
Analytics Vidhya
Published in
3 min readDec 30, 2020

--

In this post I will discuss the most common mistake done when performing Machine Learning tasks in time-series data. I will base my recommendations in time-series literature and personal experience as an ML researcher.

Photo by Benjamin Ranger on Unsplash

Time-series data

Working with time-series data is challenging and requires quite specific skills. Besides mastering Machine Learning you should grasp statistics and understand the domain that the time series belongs to. It is important to know if there will be seasonality, noise and trends. Some popular domains that produce time-series data are retail sales and finance [1]. For example, one might collect information about sales of a given product (or lots of them!), or check the price of a given security in the financial markets.

When producing time-series forecasting, we can choose from a wide range of algorithms and techniques. One can use regression to forecast future values or use binary classification to predict if the future values will go up or down. It is also possible to learn distributions, patterns and relationship between values in different timestamps.

Don’t let your data leak! — Photo by Anandan Anandan on Unsplash

Machine Learning and Time-Series

A common practice when applying Machine Learning to any kind of data is to divide the data into training, validation and test [2]. This is probably one of the main challenges when working with Time-Series Data. If not done properly, there’s the risk of data leaking occurring. Data leaking is the phenomenon of future data (test) permeating into the past data (training) and vice-versa. When data leaking happens the Machine Learning process actually sees (at least partially) the data that is supposed to be part of the test phase. This invalidates the forecasting phase, since it produces forecasts that have inflated accuracy.

This concern is raised by Prado [3] in several parts of his books about financial Machine Learning. Prado highlights that there are several papers (and even books) that are published despite data leaking. When this error is corrected, many (or all) of the discoveries are found to be false. Next time you are working with time-series, don’t let your data bias your model by preventing data leaks!

I am writing daily knowledge pills so you can keep your knowledge sharp! I write about Data Science, Machine Learning in intersection with Finance. Feel free to connect and check out my other articles! Click here to read my last knowledge pill.

References

[1] Ahmed NK, Atiya AF, Gayar NE, El-Shishiny H. An empirical comparison of machine learning models for time series forecasting. Econometric Reviews. 2010 Aug 30;29(5-6):594-621.

[2] Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. New York: Springer series in statistics; 2001.

[3] De Prado ML. Advances in financial machine learning. John Wiley & Sons; 2018 Feb 21.

--

--

Victor Silva, M. Sc.
Analytics Vidhya

Data Science | Finance | Machine Learning | Ethics| I hold two M.Sc. in Computer Science and I’m a PhD Researcher at the University of Alberta.