Home > News > AI

The Zhida paper demonstrates that GPT-4 achieves an impressive accuracy rate of up to 60% in stock selection. Will human stock analysts be replaced? AI experts raise concerns about data contamination.

Tue, May 28 2024 07:56 AM EST

?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F633906ecj00se4o3400lcd200u000cvg00it0082.jpg&thumbnail=660x2147483647&quality=80&type=jpg New Wisdom Times Report

Editor: Editorial Department

New Wisdom Times Summary: GPT-4 has surprisingly outperformed most human analysts when it comes to stock selection, even surpassing specialized models trained for finance. Without any context, it was able to analyze financial statements successfully, leaving many industry experts astonished. However, the good times didn't last long, as an AI expert pointed out a bug in the research: the training data may have been contaminated.

Recently, industry experts were shocked by a paper from Zhi University.

Researchers found that stocks selected with the help of GPT-4 directly outperformed humans! It also outperformed many other machine learning models trained for finance.

What shocked them the most was that LLM could successfully analyze numbers in financial statements without any narrative context! ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fa78c4220j00se4o34001ad200po007sg00id005k.jpg&thumbnail=660x2147483647&quality=80&type=jpg Paper Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311

In predicting changes in returns, LLM outperforms experienced financial analysts. Particularly in stock selection, human analysts face challenging scenarios leading to biased and inefficient forecasts, where LLM demonstrates significant advantages.

Moreover, LLM's forecasts go beyond mere recollection of training data, offering insightful analyses like those provided by GPT-4, even revealing a company's potential future performance.

GPT-4 excels, surpassing other models in achieving higher Sharpe ratios and alphas.

Ethan Mollick, a professor at the Wharton School, praises: "This is a much-anticipated paper by many." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F56f640a7j00se4o3600hmd200t000w8g00id00ke.jpg&thumbnail=660x2147483647&quality=80&type=jpg Some netizens also lamented: in the future, it's hard to say whether it will be humans or AI who will be trading in the stock market... ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Faf73fab8j00se4o370023d200s6006xg00id004i.jpg&thumbnail=660x2147483647&quality=80&type=jpg However, just as everyone was getting excited, diligent researchers poured cold water on this study: the reason behind this result could very likely be due to contamination in the training data.

AI expert Tian Yuandong also mentioned that the outstanding performance of GPT-4 may not rule out the possibility that the training dataset included future stock prices, hence GPT-4 excelled directly, using this to select stock samples from 2021 onwards. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fe810cc97j00se4o380086d200se00j9g00id00cg.jpg&thumbnail=660x2147483647&quality=80&type=jpg As for testing whether GPT-4 is cheating, in theory, it's not that complicated: just get the historical records of stocks, rename them as a new code, and input them to test. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fe3bf6df7j00se4o39002ad200sa0083g00id0058.jpg&thumbnail=660x2147483647&quality=80&type=jpg Research Content

How to measure the role of LLM in future decision-making? In this study, the researchers measure the standard by having LLM conduct financial statement analysis (FSA).

The reason for conducting FSA is primarily to understand the financial health of a company and determine if its performance is sustainable.

FSA is not simple; it is a quantitative task that requires extensive trend and ratio analysis, as well as critical thinking, reasoning, and complex judgment. Typically, this task is carried out by financial analysts and investment professionals.

In the study, researchers will provide two standard financial statements - the balance sheet and the income statement - to GPT-4 Turbo. Its task is to analyze whether the company's future earnings will increase or decrease.

It is important to note that a key design aspect of this study is to never provide LLM with any textual information; LLM can only reference pure financial statements. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fa52c3ecaj00se4o3a00kyd200nk00d9g00id00ab.jpg&thumbnail=660x2147483647&quality=80&type=jpg Researchers predict that the performance of Large Language Models (LLMs) is likely to be worse than that of professional human analysts.

The reason for this is that analyzing financial statements is a highly complex task involving many ambiguous elements, requiring a great deal of common sense, intuition, and the flexibility of human thinking.

Furthermore, LLMs currently lack sufficient reasoning and judgment capabilities, as well as a lack of understanding of industry and macroeconomics.

Additionally, researchers also predict that the performance of LLMs will be weaker than specialized machine learning applications, such as artificial neural networks (ANN) used for revenue forecasting.

This is because ANN allows models to learn deep interactions that contain important clues, which are difficult for general models to capture. Unless general models can make intuitive reasoning or form hypotheses based on incomplete information or unseen scenarios.

However, the experimental results surprised them: LLM actually outperformed many human analysts and specialized neural networks, demonstrating superior performance!

Experimental Procedure

To evaluate the specific performance of LLM, researchers need to follow two steps.

First, researchers anonymize and standardize the financial statements of companies to prevent LLM from memorizing potential company identities.

In particular, they removed the names of companies from balance sheets and income statements, replacing them with labels (such as t and t-1) for years.

Furthermore, researchers standardized the format of balance sheets and income statements according to the Compustat balance model.

This method ensures that the format of financial statements is consistent across all company annual statistics, so LLM does not know which company or time period its analysis corresponds to.

In the second phase, researchers designed an instruction to guide LLM in analyzing financial statements and determining future revenue directions.

In addition to simple instructions, they developed a CoT instruction, which essentially "teaches" LLM to analyze in the mindset of a human financial analyst.

Specifically, financial analysts identify significant trends in financial statements, calculate key financial ratios (such as operational efficiency, liquidity, and leverage ratios), synthesize this information, and form expectations for future earnings.

The CoT instruction created by researchers achieves this thinking process through a series of steps. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fc856c4e6j00se4o3b004kd200u000bdg00id006y.jpg&thumbnail=660x2147483647&quality=80&type=jpg In selecting the dataset, researchers utilized the Compustat database to test the model's performance, occasionally cross-referencing with the IBES database when necessary.

The sample covers annual data from 15,401 companies between 1968 and 2021, totaling 150,678 company entries.

Analyst samples span from 1983 to 2021, encompassing 3,152 companies and 39,533 observation entries. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F667f8940j00se4o3c00jkd200u000ccg00id007j.jpg&thumbnail=660x2147483647&quality=80&type=jpg Why is LLM so successful

For this outcome, researchers have proposed two hypotheses.

The first hypothesis suggests that LLM's performance is entirely driven by near-perfect memorization.

LLM likely inferred the identity and year of the company from the data, then matched this information with the sentiment learned about the company in the news.

To investigate this possibility, researchers attempted to rule it out. They also used entirely new data outside of LLM-4's training period to replicate the results.

The second hypothesis is that LLM can predict the direction of future earnings because it has generated a useful insight model.

For example, the model often calculates the markup rates computed by financial analysts and then generates narratives analyzing these rates based on cues from the context of the situation. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F0e5d05baj00se4o3d00kid200ku00bog00id00aa.jpg&thumbnail=660x2147483647&quality=80&type=jpg Researchers summarized all narratives generated annually by a company into a model, encoded them into 768-dimensional vectors using BERT, and then input these vectors into an ANN to train it to predict future revenue direction.

As a result, the ANN trained on GPT narrative insights achieved an accuracy of 59%, which is nearly as high as GPT's prediction accuracy of 60%. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F7306c0b0j00se4o3e0059d200u0007ng00id004o.jpg&thumbnail=660x2147483647&quality=80&type=jpg The results directly demonstrate that the narrative insights generated by the model are informative for future performance.

Furthermore, it can be observed that there is a 94% correlation between GPT predictions and ANN predictions based on GPT narratives, indicating that the information encoded in these narratives forms the basis of GPT predictions. Narratives related to ratio analysis are most important in explaining future profit directions.

In conclusion, the superior performance of the model stems from the narratives generated based on CoT reasoning.

Experimental Results

The experimental evaluation results from the latest research can be summarized into three key highlights.

GPT outperforms human financial analysts

To assess the accuracy of analysts' predictions, researchers calculated the "consensus forecast" (the median of analysts' predictions within one month after financial statement release) and used it as the expectation for next year's earnings.

This ensured comparability between analyst forecasts and model predictions.

Additionally, the researchers also used the "consensus forecast" for the next three and six months as alternative benchmark expectations.

These benchmarks were unfavorable to LLM as they integrated information obtained throughout the year. However, considering analysts may be slow in incorporating new information into their forecasts, the researchers reported these benchmarks for comparison.

Researchers first analyzed GPT's performance in predicting future "profit directions" and compared it to that of securities analysts.

They noted that predicting earnings per share (EPS) changes is a highly complex task, as EPS time series approximates a "Random Walk" and contains a significant amount of unpredictable components.

A Random Walk reflects predictions based solely on changes in current earnings compared to past earnings. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F150b614cj00se4o3f002td200u000f0g00id0096.jpg&thumbnail=660x2147483647&quality=80&type=jpg The following is a comparison of predictive performance between GPT and human financial analysts.

The results show that in the first month, the analysts' predictions had an accuracy of 53% in forecasting future profit direction, surpassing the simple model (extrapolating changes from the previous year) with 49% accuracy.

The accuracy of the analysts' predictions at three and six months later were 56% and 57% respectively, which is reasonable as they incorporate more timely information.

Based on "simple" non-CoT prompted GPT predictions, the performance was 52%, lower than the human analyst benchmark, as expected by the researchers.

However, when simulating human reasoning using CoT, they found that GPT achieved an accuracy of 60%, significantly outperforming the analysts.

Further verification using F1-score, an alternative metric to evaluate predictive model capability (based on a combination of precision and recall), yielded similar conclusions.

This indicates that in analyzing financial statements to determine a company's development direction, GPT clearly outperformed the median financial analyst. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F856a1fbaj00se4o3g0035d200u000iog00id00bf.jpg&thumbnail=660x2147483647&quality=80&type=jpg Honestly, human analysts may rely on soft information or broader context that models cannot access, thereby adding value.

Indeed, researchers have also found that analysts' forecasts contain useful insights about future performance that GPT did not capture.

Furthermore, studies show that when humans struggle to make future predictions, GPT's insights are more valuable.

Similarly, in cases where human forecasts are prone to bias or inefficiency (i.e., not incorporating information rationally), GPT's predictions are more useful in forecasting future returns. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fe5fa59c2j00se4o3h0040d200u000epg00id008z.jpg&thumbnail=660x2147483647&quality=80&type=jpg GPT is on par with specialized neural networks

Researchers also compared the predictive accuracy of GPT with various ML models.

They employed three predictive models.

The first model, "Stepwise Logistic," following the Ou and Penman framework, used 59 financial indicators as predictive variables.

The second model utilized the same 59 predictive variables as an ANN but also captured their nonlinearity and interactions.

Thirdly, to ensure consistency between GPT and ANN, researchers trained an ANN model based on the same information set (income statement and balance sheet) provided to GPT.

Importantly, researchers trained these models using Compustat observational data based on historical data every five years. All predictions were out of sample.

Using the entire Compustat sample, the study found that the accuracy (F1 score) of "Stepwise Logistic" was 52.94% (57.23%), comparable to human analysts' performance and consistent with previous research.

In contrast, the ANN trained on the same data achieved a higher accuracy of 60.45% (F1 score 61.62%), placing it within the realm of state-of-the-art earnings prediction models.

When predicting with GPT (with CoT), the model's accuracy across the sample was 60.31%, very close to the accuracy of ANN.

In fact, GPT's F1 score was significantly higher than ANN (63.45% vs. 61.6%). ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F0a192829j00se4o3i0027d200u000iog00id00bf.jpg&thumbnail=660x2147483647&quality=80&type=jpg In addition, when researchers trained an ANN using only the data from two financial statements input into GPT, they found that the predictive ability of the ANN was slightly low, with an accuracy rate (F1 score) of 59.02% (60.66%).

Overall, these results suggest that the accuracy of GPT is comparable to that of state-of-the-art specialized machine learning models, and in some cases, even slightly higher. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F67a758acj00se4o3j003hd200u000fag00id009c.jpg&thumbnail=660x2147483647&quality=80&type=jpg ANN and GPT Complement Predictions

Researchers have further observed that the predictions of Artificial Neural Networks (ANN) and Generative Pre-trained Transformers (GPT) are complementary, as they both contain valuable incremental information.

There are indications that when ANN performs poorly, GPT often excels.

In particular, ANN predicts profits based on training examples seen in past data. Given that many examples are highly complex and multidimensional, its learning capacity may be limited.

In contrast, GPT makes relatively fewer errors when predicting profits of small or loss-making companies, possibly benefiting from its human-like reasoning and extensive knowledge. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Ff3ac77cej00se4o3k0049d200u000eqg00id0090.jpg&thumbnail=660x2147483647&quality=80&type=jpg In addition, researchers conducted several additional experiments, partitioning samples based on GPT's confidence in its answers and utilizing different LLM families.

Predictions are often more accurate when GPT responds with higher confidence compared to predictions with lower confidence.

Furthermore, the study demonstrated that this result can be generalized to other large models. In particular, Google's recent release, Gemini Pro, boasts accuracy on par with GPT-4. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fc2f5e5a5j00se4o3l0015d200u000ikg00id00bc.jpg&thumbnail=660x2147483647&quality=80&type=jpg Source of Prediction: Growth and Operating Profit Margin

The graph below illustrates the frequency distribution of bigrams and monograms in GPT responses.

Here, bigrams refer to pairs of consecutive words used together in the text, while monograms refer to individual words.

The left side of the graph displays the top ten most common bigrams found in GPT responses related to financial ratio analysis.

On the right side, you can see the top ten most frequent monograms in GPT responses regarding binary earnings predictions. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fef3e3587j00se4o3m004ed200u000iig00id00bb.jpg&thumbnail=660x2147483647&quality=80&type=jpg The analysis was conducted to identify the most common terms and phrases used by GPT in various financial analysis contexts.

Interestingly, the terms "Operating Margin" and "Growth" demonstrate the highest predictive power.

It appears that GPT has internalized the "40 Rule." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F6e18ad52j00se4o3n007ud200u000gwg00id00ac.jpg&thumbnail=660x2147483647&quality=80&type=jpg In conclusion, all indications suggest that as AI advances, the role of financial analysts will evolve.

It is undeniable that human expertise and judgment are unlikely to be completely replaced in the short term.

However, powerful AI tools like GPT-4 could significantly enhance and streamline the work of analysts, potentially reshaping the field of financial statement analysis in the coming years.

References:

https://www.newsletter.datadrivenvc.io/p/financial-statement-analysis-with

https://x.com/tydsh/status/1794137012532081112

https://x.com/emollick/status/1794056462349861273

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311 ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F22c62aa7j00se4o3o00bmd200u002nlg00id01mi.jpg&thumbnail=660x2147483647&quality=80&type=jpg