
As large language models (LLMs) are increasingly integrated into financial workflows — including document analysis, research, compliance, and agent-based systems — the question of how to evaluate them is becoming a central challenge for both research and real-world applications.
In a new study, Evaluating AI in Finance: A Comprehensive Review and Taxonomy of LLM Benchmarks, Louis Bertucci and Murad Nuriyev (Institut Louis Bachelier) provide a systematic review of the landscape of benchmarks dedicated to AI applied to finance.
The authors analyze more than fifty public benchmarks used to evaluate language models in financial contexts. To structure this rapidly expanding literature, they introduce two complementary frameworks:
This structuring helps clarify the objectives, implicit assumptions, and limitations of current evaluation approaches in the field.
The aggregated results reveal a clear contrast across task types. State-of-the-art models perform strongly on simpler textual tasks such as sentiment analysis and entity extraction.
However, they remain significantly below expert performance on tasks requiring contextual numerical reasoning or complex agent-like financial workflows. The study also highlights that information access and quality remain major bottlenecks across many use cases.
Beyond model performance, Louis Bertucci and Murad Nuriyev identify seven recurring structural limitations in the current financial benchmarking literature. Key issues include:
The study highlights a significant gap between existing academic benchmarks and the requirements of real-world financial applications. It therefore calls for more robust evaluation frameworks that better capture the diversity of actual financial workflows.
This work, led by Louis Bertucci and Murad Nuriyev at the Institut Louis Bachelier, contributes to a broader research effort on evaluating and ensuring the reliability of AI models in finance, at the intersection of academic research and industrial applications.
Read the paper: Evaluating AI in Finance: A Comprehensive Review and Taxonomy of LLM Benchmarks (SSRN): https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6815399