Fintool outperforms the best LLM

Based on FinanceBench top 100 questions

Fintool outperforms the best LLM

Based on FinanceBench top 100 questions

Fintool outperforms the best LLM

Based on FinanceBench top 100 questions

Fintool outperforms the best LLM

Based on FinanceBench top 100 questions

+245%

Fintool is 245% better than ChatGPT-4o + internet search

No hallucinations

Fintool search engine deep dive into filings and verify the accuracy

Cite exact sources

Fintool cites exact sources from SEC filings and earnings call transcripts

Benchmarking LLM on Financial Questions

Benchmarking LLM on Financial Questions

Benchmarking LLM on Financial Questions

In this paper, we benchmark three state-of-the-art applications and models. The first is GPT4o, the second is GPT-4o combined with the new OpenAI search capabilities, and the third is Fintool, an AI equity research analyst on top of SEC filings and earnings call transcripts.


To benchmark we use FinanceBench top 100 questions, the industry leading standard for LLM performance on financial questions. It was developed by AI researchers at Patronus and Stanford and 15 financial industry domain experts. This is a high-quality dataset of questions and answers based on publicly available financial documents like SEC 10Ks, SEC 10Qs, SEC 8Ks, earnings reports, and earnings call transcripts.


Here are some examples of questions: 

  • What is the FY2018 capital expenditure amount (in USD millions) for 3M? 

  • Is 3M a capital-intensive business based on FY2022 data?

  • Does Adobe have an improving Free cash flow conversion as of FY2022?

  • What is Amazon's FY2017 days payable outstanding (DPO)? 

  • What was the key agenda of the AMCOR's 8k filing dated 1st July 2022?

  • How much was the Real change in Sales for AMCOR in FY 2023 vs FY 2022, if we exclude the impact of FX movement, passthrough costs and one-off items?


Getting a correct financial answer from a LLM is hard. LLMs aren’t trained on domain-specific knowledge which limits their ability to provide accurate financial answer. Additionally, models need up-to-date financial information to offer precise insights. Financial questions often involve numerical reasoning too, adding an extra layer of complexity for text models. Answering these questions requires models to handle both unstructured inputs, like qualitative questions in free-text form, and structured inputs, such as tabular financial metrics. The model needs to parse long passages of text, which is more challenging than reasoning about short strings from a single source. For this reason, an advanced model like GPT-4o, alone, can’t get any financial questions right. LLMs aren’t trained on domain-specific knowledge, which limits their ability to provide accurate financial answers.


These factors collectively make providing accurate financial answers difficult for large language models. However, leveraging Retrieval-Augmented Generation (RAG) techniques can significantly enhance their performance. Using a longer context window to incorporate relevant evidence like SEC filings, models can draw on more extensive data, including historical trends, recent financial news, and detailed company reports. This not only improves the precision and relevance of their answers but also ensures coherence and consistency when processing extended passages of text, allowing them to reason more effectively about intricate financial scenarios.

GPT-4o without RAG

GPT-4o without RAG

GPT-4o without RAG

GPT4-o, the most advanced LLM, got 28% of right answers and 72% of wrong answers. GPT4o has seen so much data during its training that without additional context it can answer basic questions such as: “Has CVS Health reported any materially important ongoing legal battles from 2022, 2021 and 2020?”. Unfortunately, eue to the May 2023 knowledge cut-off, GPT4o can’t answer recent questions such as: “Which Best Buy product category performed the best (by top line) in the domestic (USA) Market during Q2 of FY2024?”. 

Without search capabilities, GPT-4o kept hallucinating by creating numbers and sources. It also refused to answer specific questions as it didn’t have the data nor knowledge to answer.  

ChatGPT4o + Search

ChatGPT4o + Search

ChatGPT4o + Search

ChatGPT-4o with access to the internet got 31% of answers right and 69% wrong. The unfortunate part for ChatGPT4o is that it doesn’t know when to search the internet and which website to look for. It managed to get some right answers by looking at company websites or specialized publications like SeekingAlpha or WallStreetMine. 


The main drawback is that ChatGPT, while linking to a web page, doesn’t mention where it found the information. Because right and wrong answers look exactly the same it’s impossible to check the data accuracy without opening the filings on the side as a source of truth. 


In this example, ChatGPT-4o searched the right company website but retrieved the wrong number. It looks almost right but the answer is wrong!

The main drawback is that ChatGPT, while linking to a web page, doesn’t mention where it found the information. Because right and wrong answers look exactly the same it’s impossible to check the data accuracy without opening the filings on the side as a source of truth. 



Fintool

Fintool

Fintool

Fintool got 77% of answers right and only 23% wrong. Almost all of Fintool's mistakes happened when calculating numbers such as “$AMD quick ratio” or “ $AMZN days payable outstanding.” We are confident Fintool can get close to 100 once it gets the ability to compute numbers this summer.



The contrast with ChatGPT-4o is remarkable. Fintool not only retrieves accurate numbers but also explains the calculations in detail and provides precise citations from SEC filings. These citations are crucial for determining the reliability of the information presented.



The cherry on the cake is Fintool citing exact paragraphs and numbers in filings making it the only solution. The quality of answers are also night and day. Fintool provides more context and creates tables that you can export in CSV.



Fintool's responses are significantly superior, offering greater detail, comprehensive sources from SEC filings, and a richer context with nuanced explanations. This level of detail and citation ensures that users have access to reliable and well-substantiated information, making Fintool an invaluable tool for thorough equity analysis.

Summary

Summary

Summary

The results from the FinanceBench evaluation reveal that Fintool significantly outperforms both ChatGPT-4o with RAG and GPT-4o without RAG. Fintool achieves a notably higher percentage of correct answers on the top 100 financial questions, demonstrating its superior accuracy and reliability in providing detailed, well-sourced, and context-rich responses. This clearly establishes Fintool as the industry leader in leveraging AI for financial analysis.