Revision #1
System
16 days ago
The World's Biggest AI Models Still Can't Speak Most of the World's Languages
The most advanced AI systems on Earth — the ones writing code, passing bar exams, and diagnosing diseases — share a common weakness: they are overwhelmingly built for English speakers, and their performance degrades sharply for the vast majority of humanity that communicates in other languages.
A series of recent studies, benchmarks, and scaling experiments have quantified this gap with uncomfortable precision. The MMLU-ProX benchmark, which tested 36 leading AI models across 29 languages with 11,829 identical questions per language, found performance drops of up to 24.3% in low-resource languages compared to English [1]. For African languages like Swahili, the gap widens further — up to 38 percentage points, representing a relative performance decline of 43–54% [2]. Google DeepMind's ATLAS study, the largest public multilingual pre-training experiment ever conducted, confirmed that the problem is structural: doubling the number of languages a model supports requires 18% more model capacity and 66% more training data [3].
These are not marginal deficiencies. They mean that a farmer in Senegal querying an AI assistant in Wolof, a nurse in Myanmar checking drug interactions in Burmese, or a student in Bangladesh researching history in Bengali will receive substantially worse answers than an English speaker asking the same questions. As AI becomes embedded in healthcare, education, finance, and government services worldwide, this performance gap is becoming a civil rights issue.
The numbers tell the story
The Artificial Analysis multilingual benchmark, which evaluates leading commercial models across 16 languages, illustrates the hierarchy clearly. In English, top models like Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.1 all score between 94 and 95 out of 100. But performance erodes as languages move further from the high-resource center [4].
In Hindi, the best model (Gemini 3.1 Pro) scores 94, but many models drop several points. In Yoruba, one of Africa's most widely spoken languages with over 45 million speakers, performance falls off more dramatically. In Burmese, spoken by roughly 33 million people, the picture is worse still.
The MMLU-ProX data tells a similar story at the research level. DeepSeek-R1, one of the top performers overall with a 75.5% average accuracy, scores 79.5% in English but drops to 66.6% in Bengali — a language spoken by 230 million people [1]. GPT-4.1 scores 79.8% in English but falls to 72.2% in Bengali and 74.1% in Arabic [1].
Why AI speaks English best
The root cause is straightforward: AI models learn from data, and the internet is not linguistically democratic. English accounts for roughly 44–55% of all web content, depending on the measurement method, despite being the first language of only about 380 million people [5][6]. German, French, and Japanese each claim 4–5% of web content. After that, the numbers plummet.
Of the world's approximately 7,000 languages, only about 20 qualify as "high-resource" — meaning there is sufficient digital text to train effective AI systems [7]. According to research compiled by Sebastian Ruder, 88% of the world's languages fall into "resource group 0," meaning they have virtually no usable text data for training [8]. Another 5% have only very limited data available.
This scarcity compounds itself. With less training data comes worse model performance, which means fewer people use AI tools in those languages, which means less user-generated data is produced, which means the next generation of models is trained on an equally impoverished corpus. Researchers call this the "digital language divide," and it is self-reinforcing [7].
The commercial incentive structure makes the problem worse. Major AI companies — OpenAI, Anthropic, Google, Meta — are headquartered in English-speaking countries and derive the vast majority of their revenue from English-speaking markets. Building robust models for Quechua or Tigrinya has no immediate commercial return.
More than bad translations: AI models give different facts in different languages
The performance gap extends beyond simple accuracy scores. Research from Johns Hopkins University, led by PhD student Nikhil Sharma and colleagues Kenton Murray and Ziang Xiao, found that multilingual AI models don't just perform worse in non-English languages — they produce fundamentally different information depending on which language a user employs [9].
When researchers posed identical questions about the India-China border dispute in different languages, the models returned different narratives tailored to the linguistic perspective rather than presenting consistent facts. An English query yielded American-centric framing; a Hindi query produced Indian-centric framing; a Chinese query returned Chinese-centric framing [9].
"If you're asking about Person X in Sanskrit... the model will default to information pulled from English articles," Sharma explained, describing how models fill gaps in low-resource languages by borrowing from English-language sources — often stripping away local context and perspectives in the process [9].
The researchers characterized current multilingual AI systems as "faux polyglots" that reinforce what they call linguistic imperialism. Rather than democratizing access to information, the systems create what the Johns Hopkins team calls "information cocoons" — isolated environments where the language you speak determines not just the quality but the substance of the information you receive [9].
A separate study published in Scientific Reports found a similar pattern in medical contexts. When comparing ChatGPT, Gemini, and Claude on medical examination questions in both English and Polish, performance consistently dropped in the non-English language — raising concerns about deploying AI in clinical settings where the stakes of a wrong answer can be lethal [10].
The AI detector problem: non-native speakers flagged as robots
The language bias extends into AI detection tools as well. Stanford researchers found that AI content detectors — the tools used by universities, publishers, and employers to identify machine-generated text — achieved 100% accuracy when evaluating essays by native English speakers but falsely flagged the majority of TOEFL essays written by non-native English speakers as AI-generated [7].
This means that a student from Vietnam or Nigeria writing in perfectly legitimate but non-native English is more likely to be accused of cheating by the very tools designed to catch AI use. The irony is pointed: the systems built to detect AI are biased in the same way as the AI itself.
The cost of multilinguality
Google DeepMind's ATLAS study, presented at ICLR 2026, offers the most rigorous accounting yet of what it actually costs to build AI that works across languages. The research team ran 774 training experiments spanning models from 10 million to 8 billion parameters, with data from more than 400 languages and evaluations in 48 [3].
Their central finding formalizes what researchers have long suspected: supporting more languages in a single model comes at a per-language performance cost — a phenomenon known as the "curse of multilinguality." To support twice as many languages at the same performance level, you need to increase model size by 1.18x and training data by 1.66x [3].
The study also mapped cross-lingual transfer — the degree to which training on one language helps performance in another. Script and language family are the strongest predictors of positive transfer. Scandinavian languages help each other. Malay and Indonesian form a high-transfer pair. English, French, and Spanish offer the broadest cross-lingual benefits, likely because of their data scale and diversity [3]. But these transfer effects are asymmetrical: Language A may help Language B far more than Language B helps Language A, meaning low-resource languages benefit from high-resource ones but cannot return the favor.
Who is trying to close the gap
Several initiatives are attempting to address the disparity, though the scale of the problem dwarfs current efforts.
Google's Gemini models have invested heavily in multilingual capability and currently lead commercial benchmarks across the broadest range of languages. Gemini 3.1 Pro tops the Artificial Analysis multilingual leaderboard with a 93/100 average across 16 languages [4].
InkubaLM, developed by the South African AI company Lelapa, is a small language model specifically designed for low-resource African languages. Despite its modest size, it outperformed four of six larger models on AfriMMLU, an African-language benchmark [11].
Latam-GPT, an open-source initiative, aims to incorporate Indigenous Latin American languages including Mapudungun, Náhuatl, Quechua, and Aymara — languages with millions of speakers but negligible digital presence [12].
Qwen models from Alibaba have expanded support to over 100 languages, with particular attention to Asian languages including Mandarin, Japanese, and Korean [8].
At the academic level, conferences like EMNLP and ICLR have increasingly spotlighted multilingual research, though a persistent imbalance remains: approximately 70% of papers at the Association for Computational Linguistics' main conference still evaluate only on English [8].
What happens next
The AI industry is at an inflection point. As large language models become infrastructure — embedded in search engines, customer service systems, medical records, legal databases, and government portals — the question of which languages they serve well is no longer a technical curiosity. It is a question about who gets access to functioning technology and who does not.
More than 5 billion people speak a language other than English as their first language. Many of them live in countries where AI-powered services are being deployed fastest — India, Indonesia, Nigeria, Brazil. If the current trajectory holds, these populations will receive AI services that are measurably worse than what English speakers get, widening existing economic and informational inequalities rather than narrowing them.
The technical solutions exist. Google's ATLAS study demonstrates that thoughtful data mixing and scaling can meaningfully improve multilingual performance. Specialized models like InkubaLM prove that smaller, focused efforts can outperform larger general-purpose systems for specific language communities. But the commercial incentives still point overwhelmingly toward English, and the data scarcity problem for thousands of languages cannot be solved by any single company or research lab.
As AI researcher Sebastian Ruder has documented, the state of multilingual AI reflects a broader truth: the technology industry builds for its most profitable customers first, and everyone else waits [8]. For the speakers of the world's 6,980 non-high-resource languages, the wait continues.
Sources (12)
- [1]MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluationarxiv.org
Benchmark covering 29 languages with 11,829 identical questions per language found performance gaps of up to 24.3% in low-resource languages across 36 LLMs.
- [2]MMLU-ProX Benchmark Results and Leaderboardmmluprox.github.io
English-Swahili gaps of up to 38 points (relative 43-54% drop) demonstrated across all tested models, with performance degradation tracking resource availability.
- [3]ATLAS: Practical Scaling Laws for Multilingual Models — Google Researchresearch.google
774 training runs across 10M-8B parameter models found that doubling supported languages requires 1.18x model size and 1.66x training data to maintain performance.
- [4]Multilingual AI Model Benchmark — Artificial Analysisartificialanalysis.ai
Commercial model benchmark across 16 languages shows Gemini 3.1 Pro leading at 93/100 average, with all models showing performance decline in low-resource languages.
- [5]Usage Statistics of Content Languages for Websites — W3Techsw3techs.com
English accounts for approximately 55% of all website content, with Spanish at 5% and Russian at 4.9%.
- [6]English Accounts for 49.40% of Internet Content in 2024intelpoint.co
English vastly outpaces the combined share of the next three languages in internet content, creating a self-reinforcing data advantage for AI training.
- [7]How Language Gaps Constrain Generative AI Development — Brookings Institutionbrookings.edu
Only 20 of the world's 7,000+ languages are considered high-resource, with LLMs underperforming for billions of non-English speakers.
- [8]The State of Multilingual AI — Sebastian Ruderruder.io
88% of the world's languages have virtually no text data; 70% of ACL papers evaluate only on English; data quality in multilingual corpora remains critically low.
- [9]Multilingual Artificial Intelligence Often Reinforces Bias — Johns Hopkins Universityhub.jhu.edu
Research found AI models give contradictory information based on query language, creating 'information cocoons' that reinforce linguistic imperialism.
- [10]Comparing the Performance of ChatGPT, Gemini, and Claude in English and Polish on Medical Examinationsnature.com
Study found consistent performance drops when AI models were tested in Polish versus English on medical examination questions.
- [11]InkubaLM: A Small Language Model for Low-Resource African Languages — Lelapa AIlelapa.ai
Specialized small model for African languages outperformed four of six larger models on AfriMMLU benchmark despite modest parameter count.
- [12]Studies Explore Challenges of AI for Low-Resource Languages — Tech Brewtechbrew.com
Coverage of Stanford and other research on how AI systems fail speakers of low-resource languages, including initiatives like Latam-GPT for Indigenous languages.