IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Article Status
Published
Authors/contributors
- Adelani, David Ifeoluwa (Author)
- Ojo, Jessica (Author)
- Azime, Israel Abebe (Author)
- Zhuang, Jian Yun (Author)
- Alabi, Jesujoba O. (Author)
- He, Xuanli (Author)
- Ochieng, Millicent (Author)
- Hooker, Sara (Author)
- Bukula, Andiswa (Author)
- Lee, En-Shiun Annie (Author)
- Chukwuneke, Chiamaka (Author)
- Buzaaba, Happy (Author)
- Sibanda, Blessing (Author)
- Kalipe, Godson (Author)
- Mukiibi, Jonathan (Author)
- Kabongo, Salomon (Author)
- Yuehgoh, Foutse (Author)
- Setaka, Mmasibidi (Author)
- Ndolela, Lolwethu (Author)
- Odu, Nkiruka (Author)
- Mabuya, Rooweither (Author)
- Muhammad, Shamsuddeen Hassan (Author)
- Osei, Salomey (Author)
- Samb, Sokhar (Author)
- Guge, Tadesse Kebede (Author)
- Stenetorp, Pontus (Author)
Title
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Abstract
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
Repository
arXiv
Archive ID
arXiv:2406.03368
Date
June 5th, 2024
Accessed
13/06/2024, 08:39
Short Title
IrokoBench
Library Catalogue
Extra
arXiv:2406.03368 [cs]
<AI Smry>: The introduction of IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference, mathematical reasoning, and multi-choice knowledge-based QA ~(AfriMMLU) is introduced.
Citation
Adelani, D. I., Ojo, J., Azime, I. A., Zhuang, J. Y., Alabi, J. O., He, X., Ochieng, M., Hooker, S., Bukula, A., Lee, E.-S. A., Chukwuneke, C., Buzaaba, H., Sibanda, B., Kalipe, G., Mukiibi, J., Kabongo, S., Yuehgoh, F., Setaka, M., Ndolela, L., … Stenetorp, P. (2024). IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models (arXiv:2406.03368). arXiv. https://doi.org/10.48550/arXiv.2406.03368
Link to this record