Technology

ChatGP-Twi: how AI could bring African languages online

Big Tech has prioritised a few, mainly western languages, making translation tools useless to millions. Now AI is giving local developers the means to build their own

April 11, 2024
On the internet, western languages are dominant, crowding out even those such as Twi, which are spoken by millions
On the internet, western languages are dominant, crowding out even those such as Twi which are spoken by millions

Outside of the largest public university in Ghana’s Ashanti region, two linguists are wide-eyed. They’re looking at an app they have never heard of perform something they have never seen before: rapid translation of a few lines of Twi spoken into the device. The app, called Khaya, used artificial intelligence to perform the task, and achieved in moments what Google Translate has failed to do in nearly 20 years. “I know of translation from English to French and other international languages,” says Mavis Antiri Kodua, assistant lecturer at the Kwame Nkrumah University of Science and Technology. “But this is the first time I’m seeing translation for local languages. This is superb.”

Africa is home to around 2,000 languages, a third of the planet’s linguistic diversity. Just 75 of those languages have more than a million speakers each; the others vary wildly from hundreds of thousands of speakers to just a few hundred in highly interconnected communities who have passed down rich oral traditions for generations. 

But this linguistic abundance is not reflected or served by the world’s dominant technology. W3Techs, a platform that tracks which languages are most used on websites, reports that 50.8 per cent of the world’s websites with known content languages use English; the next most common language after that is Spanish, at just 5.7 per cent. More websites with known content languages use Norwegian, which has four million speakers, than use Swahili, which has 200m. Swahili, like every single other African language, appears on less than 0.1 per cent of websites. Apple’s App Store is available in 40 languages and Google Play offers 51; it is quickly obvious that many people from African nations are going to struggle to navigate information online unless they speak Arabic, English or French. As it stands, many are also unlikely to be able to translate it.

Here in Ghana, where there are more than 80 languages, Google has only supported two local languages on Google Translate: Twi and Ewe. It announced these new languages in 2022 alongside other indigenous languages from the Americas, Africa and India, bringing the total number of languages it supports to 133. But with many of them, there is still a catch. Try to speak into Google Translate or to listen to the phrases it translates for you, and you quickly find that a number of languages don’t support audio translation. In Sub-Saharan Africa alone, one in three adults cannot read; an app that is limited to texting and reading will only go so far. But in the age of language learning models such as ChatGPT, Gemini and Claude, local developers are taking advantage of community efforts and generative AI to build translation tools of their own. 

Felix Akwerh, a machine learning engineer, is part of the active local developer community that has helped create Khaya, the app that offers automatic speech recognition in Twi, as well as Ga and Dagbani, and is continuing to build its capabilities in other Ghanaian languages such as Ewe. It has even expanded into other African languages such as Yoruba, Kikuyu and Luo. Ghana NLP is an open-source initiative and entirely volunteer-led; Felix is motivated by the possible use the app could have in hospitals where doctors regularly treat patients who speak totally different languages, or in courtrooms where translators are in short supply. When I met him for a YouTube documentary I made (supported by the Lloyds Register Foundation) about feelings towards AI in the country, and introduced him to the linguists I had just interviewed, he didn’t once mention the fact that he does this as a hobby. “He’s doing all of this unpaid!” I declared to the lecturers, whose jaws dropped. But as well as building revolutionary apps in his spare time, he is also a pastor; perhaps his pride is well humbled.  

One problem is the lack of resources: fewer than 0.1 per cent of websites have content in an African language, and it takes a terabyte of data to train a language learning model, which equates to about one million sentences in Twi. “Even with that it’s not good enough,” Felix acknowledged. So how is an app like Khaya supposed to gather text to learn from? Felix told me they had two sources; Bible translations and real people. I met some of those real people nearly 400km further north, in Tamale, where Wikipedia editors who work on articles in the Dagbani language contributed hours of audio to Ghana NLP. Wikipedia editors, too, are volunteers. “I always wanted to see the Dagbani language visible in the digital space,” says Alhassan Mohammed Awal, a local teacher and activist who took part in the recordings. “But then anytime I went online to search about Dagbani I couldn’t find anything. If information about a language is not online, people who don’t understand English can’t find any information.”

These frustrations are echoed in other countries, and many level their concerns at Big Tech. Michael Leventhal, head of AI projects at RobotsMali, says that while Google has researchers in natural language processing, it lacks expertise in African languages and structured outreach programmes with developers. This means that when Google does build translation systems, “the only responsibility is to create something new, publish the work, make source code and data available and leave it to others to use or not use the results. I personally do not believe that this model is responsible when we are talking about something that, in my opinion, should be viewed as a fundamental human right.” 

Leventhal believes the fate of languages similar to Bambara, a Malian lingua franca, could be transformed if the writing systems of every language could work on the internet; it is believed around 100 scripts from Africa, Asia and the Americas are still unencoded, meaning they can’t be written in online. In 1991, the non-profit Unicode Consortium was founded with the aim of enabling people to use computers in any language. “It required participation from every language group on earth,” says Leventhal, “comparable to the need today for each language community to ensure that there is data to train NLP systems to handle their languages. This is exactly the time to put this initiative into place… it must bring the behemoths and the RobotsMalis of the world together to make it happen.”

The fact that only a few Ghanaian languages are represented by Google Translate is all the more surprising when Accra is home to Google’s first AI Research Center in Africa. Yossi Matias, VP of Research and Engineering, appeared via video link at a news conference there, and I asked him when more African languages might be added to Google Translate. He spoke generally about the importance of language, translation and how AI could be helpful but neither named a new language nor described a timeline. Perhaps for Google, there is no rush. But for anyone in a country like Mali, where 31 per cent of adults are literate and where most Malians do not speak a European language, AI is ready to meet a serious need. 

In many ways, it is positive that communities are making tools for themselves. In Ghana, confidence in AI appeared overwhelmed by fears that it can cause more harm than good, and much of that is connected to how Big Tech has failed to protect people, or how governments have failed to regulate artificial intelligence. Reclaiming power from tech giants is wonderful, but the problem is that more autonomy only goes so far. Grassroots AI translation can truly support communities—in places such as hospitals, it could mean life or death—but it needs funding. The next terabyte of data may not come so freely.