20' M.S. in Financial Engineering at Columbia University
Seeking full time opportunities in quantitative finance/fintech/analytics
View My LinkedIn Profile
Project Overview: This project is part of an exploratory research during my internship at one securities company in China. The final conclusion is that, combining models of different configurations enhances the performance. For ethical considerations, I will not go into great details. Instead I am going to briefly decribe the task and list several empirical findings that I personally think are quite interesting. This project is mostly done by myself and this article only represents my personal point of view.
I was given a set of stock factors that were proved to be efficient in China A-share market. My task was to explore whether machine learning algorithms could incorprate all these factors into stock selection in a way that constantly outperformed conventional methods. It was basically a multifactor stock selection problem that employed machine learning to add a flavor of non-linearity.
The stock selection was performed on a walk-forward basis. At each rebalance date, a new machine learning model was trained to incorporate the latest data. The model was then used to score the entire universe of stocks, and the top ranked stocks and their corresponding weights were saved. Finally, the stocks and weights were fed into a portfolio backtesting program.
We can also think of the time-series of scores as a composite factor originated from multiple stock factors, and test its efficiency by sorting stocks into different groups and perform the backtesting separately.
As we can see in the Machine Learnig configurations section, there are so many alternative choices. In fact, in my test, I failed to find such thing as the most perfect configuration. Some were better as with absolute return, others exhibited lower volatility and drawdown. This reminded me of Ensemble Learning, a machine learnig paradigm aggregating a group of weak learners to achive a better predictive performance. During my internship, I figured out a way to combine the results from models varying in the learning algorithm, objective and datasets, which obtained a better predictive performance and backtesting outcomes than any configuration alone. In this report, the results I present was achieved by fixing classification as the default learning objective and altering algorithms and datasets(different time windows/subsamples). The algorithms I chose include LightGBM(LGB), vanilla feed-forward neural networks (ANN) and networks with LSTM layers(LSTM). I used two time windows, namely 2 years and 1 year, denoted as 100 and 50 respectively because they coresponded to data from 100 and 50 trading days. I labeled the data in three ways which corresponded to three subsamples: 1) label top 20% stocks as postive and bottom 20% as negative(TB); 2) top 20% as postive and randomly selected 20% as negative(TR); 3) bottom 20% as negative and randomly selected 20% as postive(RB). Taken together, I set up 3 x 2 x 3 = 18 different configurations.
In particular, the predicted scores of TR models are negatively correlated with TB and RB models. Models that differ in the underlying algorithm also have limited correlation ranging from 0.5 to 0.8. This fact to some extent is favorable to the use of ensemble paradigm because the low variance of ensemble learning relies on limited correlation between constituent weak learners.
The figures below present the results of factor backtest when different subsamples are used. Here the “factor” comes from the time series of ML-predicted scores: by using a bunch of stock factors as features for machine learning, we are basically converting multiple factors into one single factor. Stocks are sorted by this factor into 10 groups on a regular basis and portfolio backtests are conducted in each group. In each figure, the upper part presents the equity curves of ten portfolios, and the lower part presents the annualized returns of each portfolio.
The TB model, which uses the data of stocks with the top 20% and bottom 20% performance cross-sectionally, has the best performance in the magnitude of long and short returns. It is not surprsing, because this way of using data results in relatively high signal-to-noise ratio, and is somewhat the default setting practioners use. The test on RB model shows that the short portfolio (the 10-th group) has -38.69% yearly return, which is impressive. What’s more, the monotonicity of ten portfolios are excellent. TR model does not give a equally good performance, but the monotonicity of ten portfolio still holds with few exceptions. What is surprisinig is that, when combining the factors from TR and RB into one, the backtesting outcome is very similar to TB model alone. This is a strong evidence that aggregating models of different configurations tend to enhance the performance.
Figure 1 TB—long return: 46%, short return: -43.56%
Figure 2 TR—long return: 26.20%, short return: 7.06%
Figure 3 RB-long return 36.44, short return: -38.69%
Figure 4 Combininig TR with RB—long return: 46.51%, short return: -34.11%
Not only does choice of subsample make a difference, the length of time window also matters. Table 1 reports the results of the portfolio test where the 100 stocks with highest scores were selected at each portfolio rebalance date. Section 2.4 points out that I tested 18 models of distinct configurations, differing in algorithms, subsample or time window. XXX_Composite thus corresponds to combining the scores of models with XXX configuration into one composite score and select stocks accordingly. While I am going to analyze the table again later, here we focus on the results for 100_composite and 50_composite. As I expected, using data from less trading days can make the model more sensitive to the change of market style, thus reducing the max drawdown which usually happens when the market regime shifts. Also, using more data tend to train a more robust model that works better, especially in stable market condition which is usually the case. Consequently, 100_composite still defeats 50_composite in annualized return despite a larger max drawdown.
As is shown in table 1 below, the composite model that aggregates 18 models achieves the best annualized return, and are ranked the 2nd in other metrics except the turnover ratio.
Table 1 Performance metircs of different models Note: Maximum Relative Drawdown uses CSI 500 index as the benchmark
It is not surprising that the composite model has the best return performance, because all the labels are generated only taking future return into consideration. I believe that, if when labeling, we use the future return divided by stock volatility as the sorting criteria, which is esentially analogous to sharpe ratio , then the composite model will achieve the best sharpe ratio, probably the max relative drawdown at the same time. This argument leaves open the possiblity of addding another variation of the model configuration to further enhance the performance.
Finally, figure 5 shows the equity curve corresponding to the portfolio test of the composite model. The red line is the relative equity curve benchmarked against CSI 500, which is upward most of the time but goes down a little bit during 2017, a year when China A-share Market experienced a major regime shift. The drawdown, however, should be more of a problem to the set of factors I use.
Figure 5 Backtest equity curve: the composite model Note: Black: portfolio equity curve; green: CSI 500 index equity curve; red: portfolio equity curve relative to CSI 500