Multiple Major Shareholders and Corporate Financing Constraints — Empirical Evidence from Text Analysis

This paper is published on Management World (the leading management journal in China) on December 2017. [link][PDF(in Chinese)] The major finding lies in that companies with more than one major shareholders (an shareholder owning more than 10% of the shares outstanding is defined as a major shareholder) tends to have relatively less financing constraints. My contribution was to quantify the extent of corporat financing constraint using text analytics.

Here I briefly introduced the process to quantify corporate financing constraints, and explain a little bit when necessary:

Collect the raw text of Management Discussion & Analysis (henceforth MD&A) from annual reports of China’s listed companies, starting from 2001; Clean the text.
Use regular expressions to specify the patterns in which companies are likely to express their condition of being financially constraint in writing. An example pattern can be something like [Any Words] + face a hard time + [raising/acquiring] + [money/capital] + [Any Words].
Design a script program that combines the set of regular expressions we define with proper logic; use the script to scan all the MD&As; label any MD&A as a financially constrained MD&A if detected by the program to have sentences that match prespecified patterns.
For each year , we have a group of companies whose MD&As are identified as financially constrained MD&As. For any MD&A disclosed in that fiscal year, the more similar it is to the constrained MD&As, the more likely that the corresponding company has financing constraints. To calculate the similarity score, I map the text into a vector space using the Bag-of-Words Model and have each vector normalized to have unit length. The vector corresponding to the cluster of constrained MD&As in year is the average of text vectors of constituent MD&As normalized to have unit length. Since all vectors are normalized, the similarity score between a MD&A and the constrained MD&As is simply the dot product of two vectors. We denote the similarity score of company in year as
The raw constrained score contains confounding factors that come from two major sources. First, almost every MD&A has boilerplate contents, largely due to the disclosure rules stipulated by China Securities Regulatory Commission. Second, companies from the same industry, usually share similar vocabulary and expressions (termonology or jargon) in writing. Therefore, without saying anything about financing constraints, a MD&A can still be quite similar to a group of constrained MD&As. We need a way to rid of the confounding factors. To measure the amount of boilerplate text, I caculate the normalized average of all text vectors, and then calculate the cosine similarity score beween each MD&A vector and the average vector, on a yearly basis. To measure the amount of jargon or terminology, I calculate the avearge vector for each industry and the remaining step is basically the same. Denote these two similarity scores as BoilerplateScore and IndustryScore.
Calculate the final constrained score. To rid of the confounding factors, simply regess ConstrainedScore on BoilerplateScore and IndustryScore. The final constrained score is thus the error term obtained from the following regression: