📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva LLM, built from scratch with extensive Italian data, underperformed on academic benchmarks despite impressive technical results. This challenges assumptions about scale and native-language focus in European sovereign AI strategies.
Italy’s Minerva-3B, a European sovereign language model trained from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored only 4.9% on the INVALSI Italian school-exam benchmark, revealing significant limitations despite substantial investment and infrastructure.
The Minerva project, led by Sapienza University of Rome with support from Italy’s national research and supercomputing institutions, trained a 7-billion-parameter model from scratch using a dataset of 2.5 trillion tokens, half of which was Italian. The project aimed to demonstrate that a country-specific, from-scratch approach could produce high-quality language models tailored to national needs.
However, when evaluated on the INVALSI Italian academic content test, Minerva-3B scored only 4.9%, a near-chance performance that contrasts sharply with its impressive technical metrics and performance on other benchmarks. Researchers attribute this to the fact that, despite the large dataset, the overall size of data and parameters may still be insufficient for complex language understanding tasks at this scale. The evaluation underscores that simply increasing data and model size does not guarantee proficiency in nuanced, academic language.
This outcome indicates that the European sovereign-LLM movement, which includes projects like Italy’s Minerva and Portugal’s AMÁLIA, must reconsider the scale of native-language investment needed to achieve meaningful country-specific language understanding. Italy’s larger investment resulted in a technically advanced model but did not translate into high academic performance, raising questions about the optimal approach for national AI strategies.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Rebooting the Machines: A New Human Vision for Artificial Intelligence
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

LINKTOR Chemistry Molecular Model Kit (444 Pieces), Student or Teacher Set for Organic and Inorganic Chemistry Learning, Motivate Enthusiasm for Learning and Raising Space Imagination, A Fullerene Set
FOR BASIC TEACHING TO ADVANCED SCIENCE: 444 pieces molecular model kit, including 136 atoms, 158 bonds and 150…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

OpenAI Evals Cookbook: Designing Benchmarks for Product‑Grade LLM Features
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

2084 and the AI Revolution, Updated and Expanded Edition: How Artificial Intelligence Informs Our Future
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-LMM Strategies
The findings from Minerva suggest that European countries may need to significantly scale their native-language data and model parameters beyond current levels to develop truly effective, country-specific language models. This challenges the narrative that smaller, focused models can suffice for national AI needs and underscores the importance of investment in both data and infrastructure. The results also highlight that technical excellence alone does not guarantee functional proficiency in complex language tasks, which has implications for national AI policies and future development priorities.
Scaling and Strategy in European Sovereign AI Projects
Italy’s Minerva project is part of a broader European effort to develop sovereign language models, aiming to reduce dependence on global tech giants and tailor AI systems to national languages and contexts. Unlike Portugal’s AMÁLIA, which layered specialization onto a multilingual foundation, Minerva trained from scratch on a large dataset, reflecting a different strategic choice. Despite significant institutional backing, including Italy’s PNRR funding, the project has faced challenges in translating technical scale into linguistic proficiency, as evidenced by its poor performance on academic benchmarks.
This development occurs amid ongoing debates about the optimal architecture, data scale, and investment levels necessary for effective national AI systems, with Minerva serving as a critical case study in the limitations of current approaches.
“Minerva’s performance on the INVALSI test exposes a fundamental scaling limitation in sovereign-LLM development.”
— Thorsten Meyer, AI researcher
Unresolved Questions About Model Scaling and Performance
It remains unclear whether increasing the size of datasets and models beyond current levels will lead to improved performance on complex, academic language tasks. The specific threshold of data and parameters needed for high proficiency in national languages has not yet been established, and ongoing research is required to determine optimal strategies. Additionally, the potential for alternative architectures or training methods to overcome these limitations is still under investigation.
Future Research and Policy Directions for Sovereign LLMs
Researchers and policymakers will likely focus on exploring larger-scale models, improved training techniques, and more diverse, high-quality datasets tailored to national languages. The Minerva team continues to iterate on methodology, with upcoming case studies planned for 2025 that may shed light on how to overcome current scaling limitations. The broader European community will monitor these developments to refine strategies for effective, country-specific AI systems.
Key Questions
Why did Minerva perform poorly on the Italian academic test?
Despite extensive data and infrastructure, the model’s size and data quality may still be insufficient for complex language understanding, especially in academic contexts. The evaluation suggests that simply increasing scale does not automatically lead to proficiency in nuanced tasks.
Does this mean training from scratch is ineffective?
Not necessarily. The results highlight that scale alone isn’t enough; training quality, data diversity, and model architecture are also critical. Further research is needed to optimize these factors.
What are the implications for European AI policy?
The findings suggest that European countries may need to commit to larger investments in native-language data and model size to achieve meaningful AI capabilities, challenging current assumptions about cost-effectiveness of smaller models.
Will the results affect future European sovereign AI projects?
Yes. The results will likely influence strategic decisions, emphasizing the importance of scale and data quality, and guiding future investments and research directions.
Source: ThorstenMeyerAI.com