5-DIALECTS-BN: Unmasking the Impact of Transliteration on Bangla Dialectal LLMs
Md Mahir Jawad, Galib Mahmud Jim, Rafid Ahmed, Mir Sazzat Hossain, Md Fahim, Md Farhad Alam Bhuiyan
Large Language Models (LLMs) have achieved remarkable progress across natural language processing (NLP) tasks, yet their capabilities degrade sharply for low-resource languages and dialectally diverse settings. Bangla, the world's sixth most spoken language, exemplifies this gap: existing resources overwhelmingly target Standard Bangla, leaving its regional dialects without the benchmarks needed to develop or evaluate dialect-aware systems. We address this gap with 5-Dialects-BN, the first multi-faceted benchmark dataset for Bangla dialectal NLP. The dataset comprises 6,000 manually annotated entries spanning five major dialects: Chittagong, Barisal, Noakhali, Sylhet, and Rangpur, enriched with Romanized transliterations, English and Standard Bangla translations, and subjectivity labels. The resource supports dialect identification, dialect-to-standard normalization, machine translation, subjectivity classification, and parameter-efficient fine-tuning of multilingual LLMs.