Penrose Labs research suggests that when asked to balance a set of accounts, this generation of AI models slowly becomes less accurate over time.
At a glance
- AI demos sparked hype about automating accounting, attracting huge venture capital funding.
- Large language models are good with words but struggle with numerical accuracy.
- Startups promising full AI accounting automation have largely failed to deliver.
- The future is customised AI for specific tasks, not general-purpose chatbots.
In a slick demo that rippled across accounting circles in March 2023, OpenAI co-founder Greg Brockman showed ChatGPT smoothly calculating a tax deduction.
The model breezed through 16 pages of US tax code, explained its reasoning, and correctly calculated the deductible. For many accountants watching, it felt like a glimpse of the future – with generative AI as a harbinger of professional doom.
A growing number of startups are betting on that future arriving sooner than expected. US-based Digits, which raised over US$97 million (A$147 million) in venture capital, says it’s using generative AI to “automate the entire accounting process”. In fact, it seems money is pouring into any brand-new company promising “AI-native accounting”. Witness Zeni (US$34m), Puzzle (US$50m) and Rillet (US$100m).
The shared pitch? Hook up your bank feed and payroll, and let the model handle categorisation, reconciliation and reporting, all in real time.
It is a seductive vision. For small businesses, it hints at freedom from spreadsheets and bank recs. For software vendors, it promises margin-rich automation at scale. For the accounting profession, it suggests liberation from the tedium of transactions – and perhaps liberation from a career in accounting altogether.
In theory, it’s not far-fetched. Large language models are already parsing contracts, summarising board minutes, and generating working papers. If they can do all that with words, why not with numbers?
The reality check for AI accountants
The most popular generative AI platforms (LLMs) write well, but they count badly. Early ChatGPT struggled with basic sums. Newer models are better at arithmetic, especially when paired with a calculator, but accuracy falls as the data gets longer and the steps multiply.
Ask a model to turn several pages of bank statements into a clean CSV file (a function at the top of many accountants’ AI wishlists) and you see the cracks: dropped rows, transposed digits, missing minus signs. Pairing an LLM with proper OCR and a strict schema helps, yet the error rate is still too high to run unaided. And accounting doesn’t accept “roughly right”.
Accordingly, there are plenty of AI sceptics who doubt how well LLMs can work with numbers.
“I think to date [AI] has zero value for quantitative analysis,” veteran technology analyst Benedict Evans said in an interview in September 2025.
“Do the numbers need to be right or roughly right? Because what all of these things do is give you something that is roughly right, and ‘roughly’ is on a spectrum. If it’s wrong once in a billion years, it doesn’t matter. But the problem today is it’s not wrong once in a billion years. It’s wrong a dozen times in a page.”
AI accountants fail an empirical test
Given that LLM stands for “large language model”, it’s not surprising that they perform better with words than numbers. An LLM is essentially a large probability machine trained to show the next most likely word in a sequence, based on context clues from its vast training data.
This may be why empirical studies of popular LLMs suggest that they are not suited to performing accounting tasks accurately. Penrose Labs, a team of AI researchers based in New York, created a benchmark, AccountingBench, that measured the most popular models’ ability to close the books for a real business. (See the graph at the top of this page.)
The benchmark was built on one year’s financial data from a real SaaS company with millions of dollars in revenue. AccountingBench compared the accuracy of the models, with a human CPA providing a baseline.
“The strongest models make forward progress and perform reasonably well after the first month’s close, with balances typically within 1 percent of those calculated by human CPAs,” the researchers found.
“However, as time progresses, all models accumulate compounding errors and exhibit erratic behaviour, causing significant deviations – overall balances diverge by over 15 percent (approximately half a million dollars).”
“The problem today is [AI]’s not wrong once in a billion years. It’s wrong a dozen times in a page.”
Benedict Evans
LLMs’ accounting performance slowly deteriorates
The researchers found that standard models sometimes made errors in categorisation of transactions. And sometimes they recorded transactions incorrectly before fully understanding the source data, and then struggled to correct their mistakes afterwards.
Once errors crept into the dataset, the models became confused and gave up on trying to correct or adjust the data. This introduced more errors, and the inaccuracy quickly compounded.
AI accountants fail a market test
Accounting technology startups have tried to blend software and services to automate the accounting process with little success. One of the most well-known examples in the US, Bench, raised US$100 million and acquired nearly 40,000 customers – before suddenly shuttering its platform in December 2024.
Bench had promised to back its bookkeepers with software engineers who would create automations for every client request, eventually building a system that would reconcile all on its own. However, after 13 years of trying, it eventually decided that automating accounting was simply too hard.
A similar concept, ScaleFactor, had closed several years earlier, despite also raising US$100 million. Forbes alleged that the company had been hiring cheap Filipino bookkeepers and pretending their work was done by AI.
A bigger, better model?
While AI struggles to meet the hype of automating an entire set of books, it is still making a real difference in specific accounting tasks. These include pulling data from documents, suggesting codes, matching transactions flagging anomalies, and drafting notes for management packs.
A Stanford–MIT study of firms using AI-enabled bookkeeping tools found that accountants using the software supported more clients and closed their accounts faster. On average they finalised month-end 7.5 days sooner, reallocated 8.5 percent of their time from data entry to client communication and QA, and produced richer ledgers with a 12 percent lift in detail. Crucially, quality didn’t fall. In many cases it improved.
Debating whether generic chatbots can keep a ledger may be the wrong fight. The industry is already moving to customised models with guardrails, checks and domain context. If you can constrain the problem, you can potentially cut errors to a level a finance team can live with.
Digits’ latest white paper makes that case in numbers. It benchmarked 19 frontier models on 17,792 real transactions from 100+ small businesses. It found a ceiling: no general-purpose LLM cleared 70% accuracy on transaction classification. Digits’ own system, which included workflows to block out-of-chart categories, scored higher and faster.
Two practical tricks from that paper matter for accountants:
- First, source awareness: identical merchants behave differently depending on the funding account, so you need models that treat “Uber” on the corporate card unlike “Uber” on CoGS – a blind spot for generic LLMs.
- Second, strict CoA controls: preventing suggestions outside the client’s chart reduced the number of incorrectly coded transactions.
AI learns tool use
There’s also a separate, under-appreciated angle to AI’s potential accounting future. LLMs can use tools. If you force them to write and run code – a calculator, a regex parser, a VAT routine – you can get exact maths and repeatable steps. After all, to an LLM, computer code is just another language.
The strongest signal for AI accounting’s immediate future comes from enterprise software. Anthropic launched Claude for Financial Services in July. It takes Claude’s enterprise model, which has a much longer memory than its consumer versions, and customises it for financial research.
The platform can analyse market feeds and financial data stored on enterprise databases like Databricks and Snowflake. Claude claims analysts can also modernise trading systems, develop proprietary models, automate compliance, and run complex analyses.
It has been beta tested by trillion-dollar financial institutions such as the Norwegian sovereign wealth fund, NBIM, and global insurer AIG, with apparently impressive results.
“Claude has fundamentally transformed the way we work at NBIM,” says NBIM CEO Nicolai Tangen. “With Claude, we estimate that we have achieved 20 percent productivity gains, equivalent to 213,000 hours.”
“Our portfolio managers and risk department can now seamlessly query our Snowflake data warehouse and analyse earnings calls with unprecedented efficiency. From automating monitoring of newsflow for 9,000 companies to enabling more efficient voting, Claude has become indispensable.”
It’s too early to say whether Anthropic’s finance platform will meet its ambitions. It is aimed at large institutions; most smaller firms won’t see tools like this for a while. Even so, the early reception challenges the idea that LLMs can’t do accounting.
The jury will stay out until the profession sees more evidence over time. It may be that the computing power needed for acceptable accuracy remains affordable only to enterprise.
But given what we have seen so far, and the breakneck pace of improvement, you’d be rash to write off LLMs in accounting.
Explore a mix of expert-led sessions, hands-on workshops and meaningful networking opportunities at IPA’s National Conference 2025 which has been designed to inspire, challenge, and energise you. Register now.










