The dangers of stock-price forecasting blogposts

Identify and avoid common mistakes

Victor Silva, M. Sc.
3 min readFeb 8, 2021

I will keep this short and simple.

Always suspect all the posts you see about forecasting stock prices. It does not matter who, it does not matter where. If you do not understand it, do not follow it.

Allow me to explain my reasoning.

Beware of the copy-paste practitioner — Photo by Jason Briscoe on Unsplash

At this point I must have read a couple hundred papers, books and blogposts about time-series and specially stock price forecasting. By far, blog of unseasoned practitioners are the ones where I find the most naïve mistakes. Here’s some quick things that you should notice when reading such literature.

Financial Data is NOT like any other Machine Learning data sets

Stock data contains two essential characteristics that make it unique, and specially easy to make a mess: time and memory [1]. If you, like me, have read a lot of posts that promise to predict stock prices, it does not matter the technique, you will notice that the vast majority of them look like a lagged moving average. In some extreme posts you will notice that some predict the forecasts look like the exact time-series with one-step lag.

Data Leakage

If you ever see a prediction that is too close or resemble a moving average, you should probably suspect that data from the past is leaking into the future. Why is that? Many Machine Learning practitioners learn that they should split the data into training and testing. They do that by selecting data X_train until time t, then data X_test from time t forward. The problem is that the samples contain data from each other. So, fundamentally, there is data from the test set in the train set and vice-versa. The data set is impure.

How to fix?

If your sample has size N, you have to discard at least N+1 steps from either size (preferably both). Why? Because then you guarantee that data from one side does not interfere with the other side.

Data Leaks are unacceptable in Financial Machine Learning — Photo by Daan Mooij on Unsplash

Scaling data from future and past together

Another basic mistake is to scale the data X before splitting, or to use the same scaler in the X_train and X_test data. What this effectively does is letting the future data influence in past data. This is data leakage since the future data points influence the scaling of the past, therefore the model will already know the future. This mistake is most commonly found in approaches that use Neural Networks.

How to Fix?

Financial data must be split before scaling. If you see a code that scales the data before splitting, or that uses the same scaler trained using training data to scale the test data, the scale is biased.

I am writing knowledge pills so you can keep your knowledge sharp! I write about Data Science, Machine Learning in intersection with Finance. Feel free to connect and check out my other articles! Also check faidhwealth.com for my latest Financial and Machine Learning experiment.

References

[1] De Prado ML. Advances in financial machine learning. John Wiley & Sons; 2018 Feb 21.

--

--

Victor Silva, M. Sc.

Data Science | Finance | Machine Learning | Ethics| I hold two M.Sc. in Computer Science and I’m a PhD Researcher at the University of Alberta.