Beyond English
Multilingual models like mBERT, XLM-RoBERTa, and multilingual LLMs are trained on text from 100+ languages simultaneously. They develop cross-lingual representations: words with similar meanings in different languages get similar vectors, even without explicit translation data. This enables zero-shot cross-lingual transfer: fine-tune on English NER data, deploy on German NER with no German training data. Performance is typically 70–85% of a monolingual model. But multilingual NLP faces significant challenges. Low-resource languages (most of the world's 7,000 languages) have little training data and perform poorly. Typological diversity: languages differ in word order, morphology, and writing systems. Script differences: Chinese, Arabic, and Devanagari require different tokenization strategies. The field is making progress but remains heavily biased toward high-resource languages like English, Chinese, and European languages.
Multilingual Models
Key models:
mBERT: 104 languages, 110M params
XLM-RoBERTa: 100 languages, 550M
Multilingual LLMs: GPT-4, Gemini, etc.
Cross-lingual transfer:
Train on English NER data
Test on German NER (zero-shot)
Performance: 70-85% of monolingual
Challenges:
Low-resource languages: poor performance
7,000 languages, <100 well-served
Typological diversity (word order, morphology)
Script differences (tokenization)
Bias toward high-resource languages
Progress:
AfroLM, IndicBERT: regional models
Language-adaptive pre-training
Community-driven data collection
Key insight: Multilingual NLP is one of the field's biggest equity challenges. Most NLP research and tools serve English speakers. Making NLP work for the world's 7,000 languages requires not just better models but better data, evaluation, and community engagement.