Lost in Translation: AI’s Struggles with Scanian – A Study on Language Models’ Attempts to Conquer Swedish and its Dialects
Author
Summary, in English
The recent surge in the development of artificial intelligence has greatly improved the accuracy of automatic speech recognition. However, as with all abrupt technical improvements, it will have its weaknesses, in this case the lack of dialectal training data. This thesis investigates if AI has an unevenly distributed understanding of regional speech, focusing on the disregard of southern Swedish dialects found in the region of Scania. This is done by comparing the word error rate of spontaneous AI-generated transcriptions of standard Swedish speech to AI-generated transcription of spontaneous Scanian speech and further analyzed to find potential causes as to why it could be hard for AI to understand it. The results show that AI is significantly worse at understanding regional Scanian dialects compared to standard Swedish. This study highlights how AI is unproportionally trained using skewed data, favoring the speech of standardized language.
Publishing year
2025
Language
English
Full text
- Available as PDF - 908 kB
- Download statistics
Document type
Student publication for Bachelor's degree
Topic
- Languages and Literatures
Keywords
- Artificial intelligence
- automatic speech recognition
- dialect
- Scanian
- Swedish
- transcription
Supervisor
- Johan Frid