The link to the paper: TBA
This repository contains the analysis results and supplementary materials for our RANLP 2025 paper investigating how well current LLMs handle fact verification across different languages and writing systems. Our research reveals significant performance disparities based on script systems and identifies systematic cross-lingual instruction following failures.
- Script System Impact: Latin script languages consistently outperform others across all models
- Cross-lingual Transfer Patterns: Norwegian achieves exceptional zero-shot performance (0.34 macro-F1), while South Asian languages struggle significantly
- Official Language Support: Languages officially supported by models (e.g., Indonesian and Polish for Qwen 2.5) often outperform traditionally high-resource languages like German and Spanish
- Instruction Following Failures: Systematic failures in producing requested English labels, particularly affecting non-Latin script languages
├── model_biases/ # Analysis of model prediction biases across labels
├── model_performance/ # Detailed performance results
└── prompt/ # Prompt
- Llama 3.1 (8B): Supports 8 languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai
- Qwen 2.5 (7B): Broadest coverage with 29 languages, strong in Asian languages
- Mistral Nemo (12B): Largest model supporting 11 languages including European and some Asian languages
- Languages: 25 languages across 11 language families
- Claims: 31,189 total claims
- Categories: 7 fine-grained veracity labels
true,mostly_true,partly_true,mostly_false,false,complicated,other
- Evaluation splits: In-domain (test), out-of-domain, zero-shot cross-lingual
See the training data distribution across languages and veracity labels below:
- Few-shot prompting: 7 examples balanced across languages and veracity categories
- Fine-tuning: LoRA with rank 16, alpha 32, targeting attention and feed-forward components
- Latin: German, Spanish, Indonesian, Italian, Polish, Portuguese, etc.
- Arabic: Arabic, Persian
- Devanagari: Hindi, Marathi
- Other: Georgian, Tamil, Bengali, Gujarati, Punjabi, Russian, Sinhala
- Well-represented: German, Spanish, French, Arabic
- Moderately-represented: Portuguese, Italian, Dutch, Polish, Turkish, Persian, Hindi, Russian, Serbian
- Under-represented: Indonesian, Romanian, Georgian, Tamil, Bengali, Punjabi, Marathi, Albanian, Azerbaijani, Gujarati, Norwegian, Sinhala
A comparative performance of Mistral Nemo, Llama 3.1, and Qwen 2.5 models on test and zero-shot subsets, measured by macro-F1 scores in LoRA finetuning and few-shot prompting settings.
TBA
- X-Fact Paper: https://aclanthology.org/2021.acl-short.86.pdf
For any questions or collaboration:
- Hanna Shcharbakova - Saarland University, aniezka.sherbakova@gmail.com

