20' M.S. in Financial Engineering at Columbia University
Seeking full time opportunities in quantitative finance/fintech/analytics
View My LinkedIn Profile
Project Overview: This is the abridged version of my graduate thesis at Renmin University of China, completed during the spring of 2019. I built a new industry classification for China’s listed companies using product & services descriptions disclosed in annual/semi-annual reports. This classification performed well as compared to conventional classifications. Consistent with a behavioral finance perspective, the new industry momentum factor, calculated as the average past return of industy peers, was proved to have much stronger predictive power for future stock returns than conventional industry momentum factor.
Conventional industry classifications typically categorize a company into one industry. However, in China a large proportion of listed companies have their business spanned across several industries or sectors. Investors who rely on conventional classifications thus tend to ignore industry-level information relevant to a focal company’s secondary business, resulting in delayed incorporation of such less invisible information into stock prices — a phenomenon called limited attention in behavioral finance literature. Ideally, capturing the information that investors overlook would help better predict future stock returns.
My idea of using text data for industry reclassification originated from my internship at an NLP tech startup. However, when I did my graduate thesis, Hoberg and Phillips (2016) already proposed a solution that had been widely accepted in academia. Hence, the following steps are based on their paper, with some steps adjusted or newly added to be compatible with Chinese language.
Starting from the 2015 annual report (and the 2017 semi-annual report), the China Securities Regulatory Commission(henceforth CSRC) requires listed companies to disclose their business profiles. This section contains a large number of vocabulary related to the company’s products and services. If the product & services descriptions of two companies are similar, they probably belong to the same industries. Based on this observation, my study uses text information to redo the industry classification. The specific steps are as follows:
The definition of industry peers here makes it more like a network than a classification. In fact, the classification proposed by Hoberg and Phillips (2016) was named Text-based Network Industries (TNIC for short).
SWS Ind is the most popular industry classification among finance practioners in China. This study thus uses SWS Ind as the major benchmark. CSRC has also published industry classification (denoted as CSRC Ind for short), which is the default choice in academic research. But as you will see later, CSRC Ind performs worse than SWS Ind.
When working on this thesis, the annual report for fiscal year 2018 was not completely availble. Consequently, I gathered five periods of reports that had product & services information disclosed. Industry reclassification, namely the process described above on how to define industry peers, was conducted on Apr 30 (Aug 31) when most annual(semi-annual) reports had been made public. To further avoid look-ahead bias, the text for companies that failed to issue reports before the expected date was not updated.
Table 1 number of companies with at least one industry peer:
2015A | 2016A | 2017H | 2017A | 2018H |
---|---|---|---|---|
2687 | 3035 | 3248 | 3431 | 3501 |
As shown in Table 1, text-based industry covers 95% ~ 99% of the entire A-share universe.
Table 2 descriptive statistics: number of industry peers for a focal company
2015A | 2016A | 2017H | 2017A | 2018H | |
---|---|---|---|---|---|
mean | 51.95 | 57.62 | 61.61 | 65.91 | 67.03 |
std | 36.97 | 43.72 | 43.89 | 52.96 | 52.41 |
min | 1 | 1 | 1 | 1 | 1 |
25% | 23 | 23 | 27 | 23 | 26 |
50% | 43 | 46 | 51 | 51 | 53 |
75% | 72 | 81 | 86 | 95 | 94 |
max | 215 | 225 | 264 | 284 | 292 |
Table 2 shows the distribution of the number of industry peers a focal company has under the text-based industry classification. For comparison, Table 3 lists the same statistics when using SWS Ind, for the same universe of stocks. Obviously, the order statistics are of comparable scale, which results directly from using SWS Ind as a benchmark.
Table 3 descriptive statistics: number of industry peers for a focal company defined by SWS Ind
2015A | 2016A | 2017H | 2017A | 2018H | |
---|---|---|---|---|---|
mean | 54.54 | 60.90 | 62.91 | 67.03 | 67.43 |
std | 43.02 | 47.41 | 47.93 | 51.40 | 51.77 |
min | 0 | 0 | 0 | 0 | 0 |
25% | 22 | 27 | 29 | 29 | 31 |
50% | 41 | 44 | 47 | 49 | 50 |
75% | 69 | 81 | 87 | 96 | 100 |
max | 159 | 177 | 177 | 193 | 194 |
In this section, I try to demonstrate the desirable characteristics of text-based classification using both qualitative and quantitative methods.
The table below shows two randomly selected focal companies and the SWS Inds corresponding to their industry peers. In both cases, the majority of peer firms have their SWS Inds highly consistent with the focal company. For example, among the 178 peers of Renfu Medicine, which is a pharmaceutical manufacturer, 98 are from the traditional Chinese medicine industry, the pharmaceutical production industry and the biological products industry. It is of course not by coincidence. Comapnies with closely related Shenwan Industries tend to use similar vocabulary when describing their businesses, which is captured by my text-based industry classification.
Table 4
Stock Name and Ticker | SWS Ind | Frequencies of SWS Inds among Industry Peers |
---|---|---|
Phoenix Media 601928.SH | cultural media | cultural media [36], computer applications [17], internet media [10], communication equipment [6], marketing communications [4], general retail [3], computer equipment [3], chemicals [2], professional retail [2], commercial property management [2], real estate development [1], optical optoelectronics [1], hotel [1], clothing home textiles [1], textile manufacturing [1], household light industry [1], papermaking [1], packaging and printing [1], audio-visual equipment [1], food processing [1], comprehensive [1] |
Renfu Medicine 600079.SH | pharmaceutical production | traditional Chinese medicine [44], pharmaceutical production [40], biological products [14], securities [13], banks [9], pharmaceutical business [8], medical services [6], chemicals [5], medical devices [5] , comprehensive [5], animal health [3], special equipment [3], clothing home textile [2], infrastructure [2], food processing [2], chemical fiber [2], real estate development [2], cultural media [1], semiconductor [1], automotive parts [1], park development [1], cement manufacturing [1], water [1], high and low voltage equipment [1], trade [1], motor [1], plastic [1], coal mining [1], metal products [1], internet media [1] |
Below is another interesting table showing two randomly selected focal companies and the top 30 high-frequency words in the product & services descriptions of their industry peers. To some extent, we can guess the industry of the focal company simply by reading the most frequent words used by its industry peers. For example, in the case of Northeast Securities almost all words are directly related to finance. This finding further demonstrates the feasibility of text-based industry classification.
Table5
Stock Name and Ticker | SWS Ind | Top 30 High-frequency Words in the Product & Services Descriptions of Industry Peers |
---|---|---|
Northeast Securities 000686.SZ | securities | banks [84], securities [82], financing [81], funds [79], transactions [71], supervision [69], bonds [66], income [65], finance [64], credit [63], Consultation [58], wealth [58], transformation [56], private placement [55], entity [54], compliance [54], RMB [54], feature [53], stock [53], consultant [51] brokerage [51], futures [51], regulation [50], individual [49], asset management [48], underwriting [48], structure [46], insurance [46], equity [45], investors [45] |
Hainan Mining 601969.SH | mining | reserves [30], minerals [29], mining [29], mines [28], metals [27], mining [26], mines [26], smelting [24], grades [23], yields [23], Mining area [23], exchange [23], ore [22], mining [22], raw materials [21], exploration [21], copper [20], use [20], quantity [20], geology [20] , mineral processing [20], trade [19], solid [19], non-ferrous metals [18], exploration [18], securities [17], utilization [17], production capacity [17], production capacity [16], chemical industry [16] |
Note: duplicated words are due to translation
Here comes my favorite part. Although we can somehow know the degree of association between any two SWS Inds simply according to their names, from now on we pretend to have no prior knowledge about these industries at all. Instead, by assuming that text-based industry classifcation is a good classification, I demonstrate a way to pick out the most related SWS Ind pairs.
As we see in 3.1, industry peers of a focal company come from different SWS Inds. Intuitively, the more related two SWS Inds are, the more likely that they appear simultaneously. There are at least two supporting arguments. First, as mentioned, companies from related Shenwan Industries tend to share similar vocabulary in their reports, and are thus more likely to appear as the peers of the same focal company. Second, for a focal company with diversified businesses, concentric diversification is more likely to happen, in which the company develops products or services closely related with its current core business. If text-based classifcation captures the diversification of the focal company, then different but highly related SWS Inds are more likely to coappear. Based on this observation, I implemented the following steps to quantify the degree of associations between two SWS Ind pairs.
There are 5,356 potential SWS Ind pairs. Listed below are the top 50 SWS Ind pairs ranked by Normalized Coappear Score. The result is pretty impressive: we identify highly related industry pairs without using prior knowledge. The result further lends credence to text-based industry classification.
Table 6 Top 50 Most Related SWS Ind Pairs
1. airport & air freight | 2. port & rail transport | 3. port & shipping | 4. food & beverage manufacturing |
5. livestock and poultry farming & feed | 6. tourism & attractions | 7. oil exploitation & Mining service | 8. shipping & rail transport |
9. airport & port | 10. insurance & securities | 11. rare metal & gold | 12. fishery & livestock and poultry farming |
13. animal health & livestock and poultry farming | 14. airport & shipping | 15. ground equipment & aerospace equipment | 16. securities & bank |
17. other mining & gold | 18. livestock and poultry farming & crop farming | 19. diversified finance & securities | 20. insurance & bank |
21. bus & airport | 22. water & environmental engineering and services | 23. attractions & hotel | 24. animal health & feed |
25. other mining & rare metal | 26. agricultural integration & crop farming | 27. insurance & diversified finance | 28. diversified finance & bank |
29. airport & rail transport | 30. traditional Chinese medicine & chemical and pharmaceutical | 31. infrastructure & housing construction | 32. industrial metal & gold |
33. chemical and pharmaceutical & biological product | 34. bus & rail transport | 35. tourism & hotel | 36. traditional Chinese medicine & biological product |
37. bus & air freight | 38. ground equipment & aviation equipment | 39. fishery & feed | 40. industrial metal & rare metal |
41. coal mining & rail transport | 42. park development & real estate development | 43. garden engineering & water | 44. general retail & clothing |
45. fishery & crop farming | 46. aerospace equipment & aviation equipment | 47. Internet media & marketing communications | |
48. animal health & fishery | 49. general retail & commercial property management | 50. medical service & biological product |
Companies in the same industry should also exhibit similar patterns in a wide variety of variables, including valuation multiples (PE ratio, book to market ratio), profitability (return on asset, return on equity), operating efficiency (asset turnover ratio), growth potential (annual sales growth), debt capacity (debt ratio), R&D expense ratio and analyst forecasted growth. If a classification is indeed well-defined, then an industry-level variable aggregated from the composite companies should have great explanatory power on company-level variables, hence a relatively high regression R-squared.
I used the following company-wise regression to compare CSRC Ind, SWS Ind and text-based industry classification. can be any of the variables mentioned above, and is the corresponding industry average excluding company :
Table 7 summarizes all the adjusted R-squared obtained from the regressions. SWS Ind and TNIC both win 4 battles, and tie in one. CSRC Ind fails to win even one battle. Notably, although the SWS Ind is the most popular one to financial analysts, it falls behind text-based industry classification in the test where analyst forecasted growth is the variable of interest. Such a result seems to indicate that analysts will identify comparable companies based on their own understanding, rather than fully relying on established industry classifications.
Table 7 Regress Company-level Variable on Industry-level Variable: R-squared results
Variables | CSRC Ind | SWS Ind | TNIC | # Samples |
---|---|---|---|---|
EP | 18.09% | 18.25% | 11.17% | 9139 |
BM | 26.55% | 27.80% | 27.80% | 9139 |
ROA | 8.71% | 8.38% | 12.38% | 9139 |
ROE | 6.53% | 6.38% | 7.42% | 9113 |
TURNOVER | 27.21% | 29.83% | 27.93% | 9139 |
GROWTH | 5.68% | 6.00% | 3.63% | 9139 |
DEBT | 22.72% | 24.20% | 27.61% | 9139 |
RD EXPENSE | 41.98% | 44.20% | 43.08% | 9139 |
FORECAST GROWTH | 5.68% | 7.00% | 8.61% | 6314 |
Note: 1. EP takes the reciprocal of PE ratio to account for negative values; 2. Bold format indicates the best among three
I have demonstrated above that text-based industry classification is reliable. Obviously, it can capture some of the industry characteristics which is not directly reflected in a conventional classification scheme. To illustrate, a real estate company with food manufacturing taking up 25% percent of its yearly revenue is categorized to real estate industry, but its text-based industry peers will include many food manufactuers. Investors, subjected to limited attention from a behavioral finance perspective, tend to ignore this fact.
The momentum effect is a common market phenomenon by which asset prices follow an upward or downward trend for a long time. As a result, the past return of the asset, usually named momentum factor, can predict future return. Likewise, industry momentum can be defined as the industry-level past return, and (in this study) its predictive power for stock future returns is called industry momentum effect. Formally, we can define the industry momentum of company during period as the average stock return of its industry peers during period . where . Here I extend the concept of to conventional classifications so that two comapnies belonging to the same industry are automatically industry peers.
As industry momentum can proxy for industry-level information, and the text-based industry classification captures industry associations that investors overlook, the text-based industry momentum will contain relevant industry-level information that slowly affects stock prices, resulting in a stronger industry momentum effect. This section presents my investigation.
First, I use Fama-Macbeth regressions to test different industry momentums on their explanatory power for stock returns. The two regressions are listed below. All return related variables are weekly returns, because in China researchers have concluded that momentum effect only exists in the short term —— in the long term it is reversal effect that predominates. is the weekly return of company at week , which controls for individual stock momentum. is the logarithmic of market value and is the logarithmic of book-to-market ratio. In the first regression, I only include the industry momentum that lagged for 1 week, while in the second one industry momentum lagged for 1 to 4 weeks are all included, with the same set of control variables. To ensure that different coefficients are directly comparable, all independent variables are normalized to the corresponding z-scores.
Table 8 summarizes the results from the first regression. Ceteris paribus, the text-based industry momentum generates the largest coefficient, t-statistics and adjusted R-squared. To my expectation, CSRC Industry Momentum is the least significant. Notably, controlling for industry momentum, the coefficients for individual stock momentum are all significantly negative, meaning that reversal is stronger than momentum.
Table 8 Fama-Macbeth Regression: including one industry momentum
Table 9 shows the results from the second regression. Industry momentums lagged for 1 week are still significantly positive, but the other industry momentums are not. Text-based industry momentum is still the strongest.
Table 9 Fama-Macbeth Regression: including four industry momentums
In this section, I implement portfolio backtests based on industry momentum factors. The test procedure originates from Jegadeesh and Titman (1993) which was the first paper to consolidate the findings of momentum effect. Sort all the stocks by their industry momentums during the past weeks, namely the formation weeks, and divide them into ten groups. For a newly formed group ranging from 1 to 10, we buy all the stocks with equal weights and hold for K weeks as the holding period. This process repeats at the begining of each week, so that at week , each portfolios consists of K sub-portfolios which are formed during week to week . By convention, the first group which corresponds to the smallest industry momentum is called the Short Portfolio, and the 10th group is called the Long Portfolio.
The above backtest is conducted for the CSRC, SWS as well as text-based industry momentums, respectively, with the formation weeks and the holding weeks varying between 1 to 12. The image below compares the backtesting results, where the vertical axis is the difference of annualized returns between the Long and the Short Portfolios. As we can see, the text-based industry momentum (right most) is the absolute winner.
Table 10 compares the backtesting outcomes of text-based industry momentum with SWS momentum, with a 3-week formation period and various holding periods. “Annualized excess return/Long(Short)” is the annualized return of the Long(Short) Portfolio minus equal-weighted annualized return of the whole stock universe. “Long - Short” is the difference of annualized returns between the Long and the Short Portfolio. “FF 3 factor alpha” is the intercept term obtained when regressing monthly returns of Long-short portfolio on monthly Fama-French 3 factors, which represents the risk premium after controling for systematic risks. “Alpha t-statistics” is the t-stat of the intercept term. The last metric, factor monotonicity, is the Spearman correlation coefficient between each portfolio’s group number (1 to 10) and the final netvalue. This metric is based on my argument that an ideal stock factor should have considerable postive/negative correlation with future stock returns. Hence, the desirbale factor monotonicity should have an absolute value close or equal to 1.
Table 10 Performance Comparison in Jegadeesh and Titman Test Note: ***, ** and * indicate significance levels of 1%, 5% and 10%
As exhibited in Table 10, the text-based industry momentum is 100% dominant over SWS Ind momentum. The factor monotonicity score is the most impressive —— the monotonicity of 10 portfolios does not decay even when the holding period is raised to 10 weeks. Also worth mentioning is the FF 3 factor alpha. The monthly alpha when the holding period is 1-week is 1.04% which is significant at 10% level. The fact that it is not quite significant may be due to the lack of enough samples in regression. Recall that we use monthly return for this regression, which means the number of available observations reduces to no more than a quater of the original. Therefore, a 10% significance level here is somewhat acceptable.
The last image shows the equity curves for different types of industry momentums, with the formation period and the holding period set to be optimal. Again, the text-based industry momentum wins. The lowest curve corresponds to the backtest that sorts stocks by stock-level momentum. Its negative slope casts doubts on the existence of stock momentum effect in China’s A-share Market.
To my expectation, the text-based industry momentum effect is stronger. I have argued that it is because of investors’ inattention to relevant industry information, which is caused by the conventional classification that categorizes a multi-sector company to a single industry. Since my text-based classification is designed to overcome this drawback, it captures the information less visible investors, and the corresponding industry momentum can thus better predict future stock returns.
Here I give an indirect test of the proposed mechanism. I decompose the text-based industry momentum into two parts. The first part is the visible momentum, which is the average past return of text-based industry peers that are also peers under SWS Ind. The second part is the invisible momentum, which is defined using text-based industry peers that are not identified as peers under SWS Ind. I use these two industry momentums as the independent variables in place of the orginal industry momentum, and repeat the two regressions listed in section 4.1. The result shows that the coefficient of invisible momentum is 50% larger, which to some extent supports the behaviroal finance perspective.
Table 11 Fama-Macbeth Regression: Invisible/Visible Momentum
As corporate reports will continue to be released, we can collect more data to further test the robustness of these findings.
I only list publications that are directly related to this post. My actual graudate thesis presents much more items.