AI in finance is advancing rapidly — but are our benchmarks keeping up?

A study by L. Bertucci and M. Nuriyev, who provide a systematic review of the landscape of benchmarks dedicated to AI applied to finance.

Jun 15, 2026 15:24

Jun 15, 2026

As large language models (LLMs) are increasingly integrated into financial workflows — including document analysis, research, compliance, and agent-based systems — the question of how to evaluate them is becoming a central challenge for both research and real-world applications.

In a new study, Evaluating AI in Finance: A Comprehensive Review and Taxonomy of LLM Benchmarks, Louis Bertucci and Murad Nuriyev (Institut Louis Bachelier) provide a systematic review of the landscape of benchmarks dedicated to AI applied to finance.

Mapping a landscape of more than 50 benchmarks

The authors analyze more than fifty public benchmarks used to evaluate language models in financial contexts. To structure this rapidly expanding literature, they introduce two complementary frameworks:

a taxonomy of eleven financial task categories;
a two-dimensional framework distinguishing what benchmarks measure and how they measure it.

This structuring helps clarify the objectives, implicit assumptions, and limitations of current evaluation approaches in the field.

Strong performance on text, but limitations in financial reasoning

The aggregated results reveal a clear contrast across task types. State-of-the-art models perform strongly on simpler textual tasks such as sentiment analysis and entity extraction.

However, they remain significantly below expert performance on tasks requiring contextual numerical reasoning or complex agent-like financial workflows. The study also highlights that information access and quality remain major bottlenecks across many use cases.

Structural limitations in current benchmarks

Beyond model performance, Louis Bertucci and Murad Nuriyev identify seven recurring structural limitations in the current financial benchmarking literature. Key issues include:

a growing gap between the increasing number of benchmarks and their ability to reflect real-world financial use cases;
a strong concentration of datasets on US-listed companies;
frequent underestimation of inference costs, despite their importance in production settings.

A key challenge for deploying AI in finance

The study highlights a significant gap between existing academic benchmarks and the requirements of real-world financial applications. It therefore calls for more robust evaluation frameworks that better capture the diversity of actual financial workflows.

This work, led by Louis Bertucci and Murad Nuriyev at the Institut Louis Bachelier, contributes to a broader research effort on evaluating and ensuring the reliability of AI models in finance, at the intersection of academic research and industrial applications.

Read the paper: Evaluating AI in Finance: A Comprehensive Review and Taxonomy of LLM Benchmarks (SSRN): https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6815399

‍

download the publication view the publication