AI models are neglecting African languages — scientists want to change that

More than 2,000 languages spoken in Africa are being neglected in the artificial intelligence (AI) era. For example, ChatGPT recognizes only 10–20% of sentences written in Hausa, a language spoken by 94 million people in Nigeria. These languages are under-represented in large language models (LLMs) because of a lack of training data. But researchers across Africa are changing that.

Language specialists have recorded 9,000 hours of people speaking different African languages and transformed the recordings into digitized language data sets. The researchers, who are part of a research project called African Next Voices, released the first tranche of data this month from what is the largest AI-ready language-data-creation initiative for multiple African languages.

The data will be open access and available for developers to incorporate into LLMs, such as those that convert speech into text or provide automatic language translation.

“It’s really exciting to see the improvements this is going to bring to the modelling of these specific languages, and how it’s also going to help the entire community that is working across language technologies for Africa,” says Ife Adebara, chief technology officer at the non-profit organization Data Science Nigeria, based in Lagos, who is co-leading the Nigerian arm of the project. Languages in Nigeria being recorded include Hausa, Yoruba, Igbo and Naijá.

“Under-representation of local languages in AI models remains a key challenge in scaling the most-promising artificial intelligence tools,” says Sanjay Jain, director for digital public infrastructure at the Gates Foundation, based in Seattle, Washington, which has funded the project with a US$2.2-million grant.

18 languages

The African Next Voices project involves recording 18 languages spoken in 3 countries: South Africa, Kenya and Nigeria. The recordings are then transcribed and translated by people, reviewed and quality checked.

Researchers take part in a transcription workshop at Dedan Kimathi University of Technology in Nyeri, Kenya. Somali, Kikuyu and Maasai speakers were represented in the training.Credit: African Next Voices: Pilot Data Collection in Kenya

The researchers showed individuals from diverse communities images and asked them to describe what they saw, explains Lilian Wanzare, a computational linguist at Maseno University in Kenya, and Next Voices project lead for Kenya, where the languages spoken include Dholuo, Kikuyu, Kalenjins, Maasai and Somali.

The focus has been to generate databases of everyday language, she says. “There’s a huge push towards localised data sets, because the impact is in capturing the people within their local settings.” For example, “if you build a model for farmers to help with decision-making, it relies on local data”, such as soil conditions and pesticides that work in the area, Wanzare explains.

Whereas the principal investigators in each country chose the subject areas for their data sets, the projects needed to focus on key development sectors, such as health, agriculture and education, says Jain.

Vukosi Marivate, a computer scientist at the University of Pretoria and the project lead for South Africa, says that his team is working with a consortium of organizations to create AI language models with the data. He hopes that technology businesses can then improve on those models. South Africa is collecting data for Setswana, isiZulu, isiXhosa, Sesotho, Sepedi, isiNdebele and Tshivenda.

Source link

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *