TOWARDSDATASCIENCE.COM
A Review of AccentFold: One of the Most Important Papers on African ASR
I really enjoyed reading this paper, not because I’ve met some of the authors before, but because it felt necessary. Most of the papers I’ve written about so far have made waves in the broader ML community, which is great. This one, though, is unapologetically African (i.e. it solves a very African problem), and I think every African ML researcher, especially those interested in speech, needs to read it.
AccentFold tackles a specific issue many of us can relate to: current Asr systems just don’t work well for African-accented English. And it’s not for lack of trying.
Most existing approaches use techniques like multitask learning, domain adaptation, or fine tuning with limited data, but they all hit the same wall: African accents are underrepresented in datasets, and gathering enough data for every accent is expensive and unrealistic.
Take Nigeria, for example. We have hundreds of local languages, and many people grow up speaking more than one. So when we speak English, the accent is shaped by how our local languages interact with it — through pronunciation, rhythm, or even switching mid-sentence. Across Africa, this only gets more complex.
Instead of chasing more data, this paper offers a smarter workaround: it introduces AccentFold, a method that learns accent Embeddings from over 100 African accents. These embeddings capture deep linguistic relationships (phonological, syntactic, morphological), and help ASR systems generalize to accents they have never seen.
That idea alone makes this paper such an important contribution.
Related Work
One thing I found interesting in this section is how the authors positioned their work within recent advances in probing language models. Previous research has shown that pre trained speech models like DeepSpeech and XLSR already capture linguistic or accent specific information in their embeddings, even without being explicitly trained for it. Researchers have used this to analyze language variation, detect dialects, and improve ASR systems with limited labeled data.
AccentFold builds on that idea but takes it further. The most closely related work also used model embeddings to support accented ASR, but AccentFold differs in two important ways.
First, rather than just analyzing embeddings, the authors use them to guide the selection of training subsets. This helps the model generalize to accents it has not seen before.
Second, they operate at a much larger scale, working with 41 African English accents. This is nearly twice the size of previous efforts.
The Dataset
Figure 1. Venn diagram showing how the 120 accents in AfriSpeech-200 are split across train, dev, and test sets. Notably, 41 accents appear only in the test set, which is ideal for evaluating zero-shot generalization. Image from Owodunni et al. (2024).
The authors used AfriSpeech 200, a Pan African speech corpus with over 200 hours of audio, 120 accents, and more than 2,000 unique speakers. One of the authors of this paper also helped build the dataset, which I think is really cool. According to them, it is the most diverse dataset of African accented English available for ASR so far.
What stood out to me was how the dataset is split. Out of the 120 accents, 41 appear only in the test set. This makes it ideal for evaluating zero shot generalization. Since the model is never trained on those accents, the test results give a clear picture of how well it adapts to unseen accents.
What AccentFold Is
Like I mentioned earlier, AccentFold is built on the idea of using learned accent embeddings to guide adaptation. Before going further, it helps to explain what embeddings are. Embeddings are vector representations of complex data. They capture structure, patterns, and relationships in a way that lets us compare different inputs — in this case, different accents. Each accent is represented as a point in a high dimensional space, and accents that are linguistically or geographically related tend to be close together.
What makes this useful is that AccentFold does not need explicit labels to know which accents are similar. The model learns that through the embeddings, which allows it to generalize even to accents it has not seen during training.
How AccentFold Works
The way it works is fairly straightforward. AccentFold is built on top of a large pre trained speech model called XLSR. Instead of training it on just one task, the authors use multitask learning, which means the model is trained to do a few different things at once using the same input. It has three heads:
An ASR head for Speech Recognition, converting speech to text. This is trained using CTC loss, which helps match audio to the correct word sequence.
An accent classification head for predicting the speaker’s accent, trained with cross entropy loss.
A domain classification head for identifying whether the audio is clinical or general, also trained with cross entropy but in a binary setting.
Each task helps the model learn better accent representations. For example, trying to classify accents teaches the model to recognize how people speak differently, which is essential for adapting to new accents.
After training, the model creates a vector for each accent by averaging the encoder output. This is called mean pooling, and the result is the accent embedding.
When the model is asked to transcribe speech from a new accent it has not seen before, it finds accents with similar embeddings and uses their data to fine tune the ASR system. So even without any labeled data from the target accent, the model can still adapt. That is what makes AccentFold work in zero shot settings.
What Information Does AccentFold Capture
This section of the paper looks at what the accent embeddings are actually learning. Using a series of tSNE plots, the authors explore whether AccentFold captures linguistic, geographical, and sociolinguistic structure. And honestly, the visuals speak for themselves.
Clusters Form, But Not Randomly
Figure 2. t-SNE visualization of accent embeddings in AccentFold, colored by region. Distinct clusters emerge, especially for West African and Southern African accents, suggesting that the model captures regional similarities. Image from Owodunni et al. (2024).
In Figure 2, each point is an accent embedding, colored by region. You immediately notice that the points are not scattered randomly. Accents from the same region tend to cluster. For example, the pinkish cluster on the left represents West African accents like Yoruba, Igbo, Hausa, and Twi. On the upper right, the orange cluster represents Southern African accents like Zulu, Xhosa, and Tswana.
What matters is not just that clusters form, but how tightly they do. Some are dense and compact, suggesting internal similarity. Others are more spread out. South African Bantu accents are grouped very closely, which suggests strong internal consistency. West African clusters are broader, likely reflecting the variation in how West African English is spoken, even within a single country like Nigeria.
2. Geography Is Not Just Visual. It Is Spatial
Figure 3. t-SNE visualization of accent embeddings by country. Nigerian accents (orange) form a dense core, while Kenyan, Ugandan, and Ghanaian accents cluster separately. The positioning reflects underlying geographic and linguistic relationships. Image from Owodunni et al. (2024).
Figure 3 shows embeddings labeled by country. Nigerian accents, shown in orange, form a dense core. Ghanaian accents in blue are nearby, while Kenyan and Ugandan accents appear far from them in vector space.
There is nuance too. Rwanda, which has both Francophone and Anglophone influences, falls between clusters. It does not fully align with East or West African embeddings. This reflects its mixed linguistic identity, and shows the model is learning something real.
3. Dual Accents Fall Between
Figure 4. Dual accent embeddings fall between single-accent clusters. For example, speakers with both Igbo and Yoruba accents are positioned between the Igbo (blue) and Yoruba (orange) clusters. This demonstrates that AccentFold captures gradient relationships, not just discrete classes. Image from Owodunni et al. (2024).
Figure 4 shows embeddings for speakers who reported dual accents. Speakers who identified as Igbo and Yoruba fall between the Igbo cluster in blue and the Yoruba cluster in orange. Even more distinct combinations like Yoruba and Hausa land in between.
This shows that AccentFold is not just classifying accents. It is learning how they relate. The model treats accent as something continuous and relational, which is what a good embedding should do.
4. Linguistic Families Are Reinforced and Sometimes ChallengedIn Figure 9, the embeddings are colored by language families. Most Niger Congo languages form one large cluster, as expected. But in Figure 10, where accents are grouped by family and region, something unexpected appears. Ghanaian Kwa accents are placed near South African Bantu accents.
This challenges common assumptions in classification systems like Ethnologue. AccentFold may be picking up on phonological or morphological similarities that are not captured by traditional labels.
5. Accent Embeddings Can Help Fix LabelsThe authors also show that the embeddings can clean up mislabeled or ambiguous data. For example:
Eleven Nigerian speakers labeled their accent as English, but their embeddings clustered with Berom, a local accent.
Twenty speakers labeled their accent as Pidgin, but were placed closer to Ijaw, Ibibio, and Efik.
This means AccentFold is not only learning which accents exist, but also correcting noisy or vague input. That is especially useful for real world datasets where users often self report inconsistently.
Evaluating AccentFold: Which Accents Should You Pick
This section is one of my favorites because it frames a very practical problem. If you want to build an ASR system for a new accent but do not have data for that accent, which accents should you use to train your model?
Let’s say you are targeting the Afante accent. You have no labeled data from Afante speakers, but you do have a pool of speech data from other accents. Let’s call that pool A. Due to resource constraints like time, budget, and compute, you can only select s accents from A to build your fine tuning dataset. In their experiments, they fix s as 20, meaning 20 accents are used to train each target accent. So the question becomes: which 20 accents should you choose to help your model perform well on Afante?
Setup: How They Evaluate
To test this, the authors simulate the setup using 41 target accents from the Afrispeech 200 dataset. These accents do not appear in the training or development sets. For each target accent, they:
Select a subset of s accents from A using one of three strategies
Fine tune the pre trained XLS R model using only data from those s accents
Evaluate the model on a test set for that target accent
Report the Word Error Rate, or WER, averaged over 10 epochs
The test set is the same across all experiments and includes 108 accents from the Afrispeech 200 test split. This ensures a fair comparison of how well each strategy generalizes to new accents.
The authors test three strategies for selecting training accents:
Random Sampling: Pick s accents randomly from A. It is simple but unguided.
GeoProx: Select accents based on geographical proximity. They use geopy to find countries closest to the target and choose accents from there.
AccentFold: Use the learned accent embeddings to select the s accents most similar to the target in representation space.
Table 1 shows that AccentFold outperforms both GeoProx and Random sampling across all 41 target accents.
Table 1. Test Word Error Rate (WER) for 41 out-of-distribution accents. AccentFold outperforms both GeoProx and Random sampling, with lower error and less variance, highlighting its reliability and effectiveness for zero-shot ASR. Table from Owodunni et al. (2024).
This results in about a 3.5 percent absolute improvement in WER compared to random selection, which is meaningful for low resource ASR. AccentFold also has lower variance, meaning it performs more consistently. Random sampling has the highest variance, making it less reliable.
Does More Data Help
The paper asks a classic machine learning question: does performance keep improving as you add more training accents?
Figure 5. Test WER across different training subset sizes. Performance improves with more accents but plateaus after around 25, showing that smart selection is more important than quantity alone. Image from Owodunni et al. (2024).
Figure 5 shows that WER improves as s increases, but only up to a point. After about 20 to 25 accents, the performance levels off.
So more data helps, but only to a point. What matters most is using the right data.
Key Takeaways
AccentFold addresses a real African problem: ASR systems often fail on African accented English due to limited and imbalanced datasets.
The paper introduces accent embeddings that capture linguistic and geographic similarities without needing labeled data from the target accent.
It formalizes a subset selection problem: given a new accent with no data, which other accents should you train on to get the best results?
Three strategies are tested: random sampling, geographical proximity, and AccentFold using embedding similarity.
AccentFold outperforms both baselines, with lower Word Error Rates and more consistent results
Embedding similarity beats geography. The closest accents in embedding space are not always geographically close, but they are more helpful.
More data helps only up to a point. Performance improves at first, but levels off. You do not need all the data, just the right accents.
Embeddings can help clean up noisy or mislabeled data, improving dataset quality.
Limitation: results are based on one pre trained model. Generalization to other models or languages is not tested.
While this work focuses on African accents, the core method — learning from what models already know — could inspire more general approaches to adaptation in low-resource settings.
Source Note:This article summarizes findings from the paper AccentFold: A Journey through African Accents for Zero Shot ASR Adaptation to Target Accents by Owodunni et al. (2024). Figures and insights are sourced from the original paper, available at https://arxiv.org/abs/2402.01152.
The post A Review of AccentFold: One of the Most Important Papers on African ASR appeared first on Towards Data Science.