The browser you are using is not supported by this website. All versions of Internet Explorer are no longer supported, either by us or Microsoft (read more here: https://www.microsoft.com/en-us/microsoft-365/windows/end-of-ie-support).

Please use a modern browser to fully experience our website, such as the newest versions of Edge, Chrome, Firefox or Safari etc.

Lost in Translation: AI’s Struggles with Scanian – A Study on Language Models’ Attempts to Conquer Swedish and its Dialects

Author

  • Kajsa Vesterberg

Summary, in English

The recent surge in the development of artificial intelligence has greatly improved the accuracy of automatic speech recognition. However, as with all abrupt technical improvements, it will have its weaknesses, in this case the lack of dialectal training data. This thesis investigates if AI has an unevenly distributed understanding of regional speech, focusing on the disregard of southern Swedish dialects found in the region of Scania. This is done by comparing the word error rate of spontaneous AI-generated transcriptions of standard Swedish speech to AI-generated transcription of spontaneous Scanian speech and further analyzed to find potential causes as to why it could be hard for AI to understand it. The results show that AI is significantly worse at understanding regional Scanian dialects compared to standard Swedish. This study highlights how AI is unproportionally trained using skewed data, favoring the speech of standardized language.

Publishing year

2025

Language

English

Document type

Student publication for Bachelor's degree

Topic

  • Languages and Literatures

Keywords

  • Artificial intelligence
  • automatic speech recognition
  • dialect
  • Scanian
  • Swedish
  • transcription

Supervisor

  • Johan Frid