IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
- David Ifeoluwa Adelani ,
- Jessica Ojo ,
- Israel Abebe Azime ,
- Jian Yun Zhuang ,
- Jesujoba O. Alabi ,
- Xuanli He ,
- Millicent Ochieng ,
- Sara Hooker ,
- Andiswa Bukula ,
- En-Shiun Annie Lee ,
- Chiamaka Chukwuneke ,
- Happy Buzaaba ,
- Blessing Sibanda ,
- Godson Kalipe ,
- Jonathan Mukiibi ,
- Salomon Kabongo ,
- Foutse Yuehgoh ,
- Mmasibidi Setaka ,
- Lolwethu Ndolela ,
- Nkiruka Odu ,
- Rooweither Mabuya ,
- Shamsuddeen Hassan Muhammad ,
- Salomey Osei ,
- Sokhar Samb ,
- Tadesse Kebede Guge ,
- Tombekai Vangoni Sherman ,
- Pontus Stenetorp
Outstanding Paper Award at NAACL 2025
Download BibTexDespite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench — a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63\% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.