SH_ArabicIraqiAccent: A Speech Corpus of Iraqi Arabic Speech in Baghdad’s Northern Suburbs

Sarah I. Hasan; Hasan M. Kadhim

doi:10.32792/utq/utjes.16.1.742

Submitted

December 25, 2025

Accepted

February 23, 2026

Published

June 1, 2026

Download

PDF

Statistic

Read Counter : 0 Download : 0

Abstract

This paper presents the SH_ArabicIraqiAccent corpus, a newly developed speech corpus that captures spontaneous Iraqi Arabic from the northern suburbs of Baghdad. This corpus comprises spontaneous speech from 47 speakers (28 females aged 9–59 and 19 males aged 20–63) discussing both personal and general community topics relevant to the region. The recordings were made in a controlled environment at a sampling rate of 16 kHz, with 16-bit resolution and a mono channel, to ensure clarity and fidelity. The total duration is approximately 2.5 hours and is professionally recorded with minimal background noise. The corpus addresses the need for dialect-specific resources, as Modern Standard Arabic corpora do not adequately represent regional spoken varieties. The dataset was carefully designed to address key parameters, including speaker diversity (age and gender), phoneme coverage, and recording quality, and supports applications in natural language processing (NLP), automatic speech recognition (ASR), text-to-speech (TTS), and speaker identification. The corpus complements existing resources by focusing on underrepresented Iraqi Arabic dialects and offers detailed acoustic analyses, including fundamental frequency and intensity measurements. This work situates the SH_ArabicIraqiAccent alongside other primary Arabic linguistic resources and highlights its potential to enhance understanding of Arabic dialects and to support model development. The corpus and relevant statistical analyses are publicly available and contribute to Arabic speech and language research, particularly for dialectal NLP tasks.

Keywords

Speech Corpus, Iraqi Accents, Arabic Language, F0

License

Copyright (c) 2026 The Author(s), under exclusive license to the University of Thi-Qar. UTJES is an open-access journal where all contents are free of charge. Articles of this journal are licensed under the terms of the Creative commons Attribution 4.0 International License Creative Commons-Attribution (CC BY 4.0) that licensees are unrestrictedly allowed to search, download, share, distribute, print, or link to the full texts of the articles, crawl them for indexing and reproduce any medium of the articles provided that they give the author(s) proper credits (citation). The authors hold the copyright for their published work on UTJES website, while UTJES is responsible for appreciating citation for their work, which is released under CC-BY-4.0 enabling the unrestricted use, distribution, and reproduction of an article in any medium, provided that the original work is properly cited.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

References

Abbas, N. F., Qasim, T. A., & Jasim, H. A.-S. (2023). Request Constructions in Classical Arabic versus Modern Arabic. Journal of Ethnic and Cultural Studies, 10(5), 1-15. https://doi.org/https://doi.org/10.29333/ejecs/1598
Al-Haff, K., Jarrar, M., Hammouda, T., & Zaraket, F. (2022). Curras+ baladi: Towards a levantine corpus. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2205.09692,
Al-Thubaity, A. O. (2015). A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation, 49(3), 721-751. https://doi.org/https://doi.org/10.1007/s10579-014-9284-1
Alotaibi, H. M. (2016). AEPC: Designing an Arabic/English parallel corpus. Research in Corpus Linguistics, 1-7. https://doi.org/https://ricl.aelinco.es/index.php/ricl/article/view/36/22
Alrayzah, A., Alsolami, F., & Saleh, M. (2024). AraFast: Developing and evaluating a comprehensive modern standard arabic corpus for enhanced natural language processing. Applied Sciences, 14(12), 5294. https://doi.org/https://doi.org/10.3390/app14125294
Alyafeai, Z., Masoud, M., Ghaleb, M., & Al-shaibani, M. S. (2022). Masader: Metadata sourcing for arabic text and speech data resources. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2110.06744,
Bordonaba-Plou, D., & Jreis-Navarro, L. M. (2023). Linguistic Injustice in Multilingual Technologies: The TenTen Corpus Family as a Case Study. In Multilingual digital humanities (pp. 129-144). Routledge. https://doi.org/https://doi.org/10.4324/9781003393696
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., & Erdmann, A. (2018). The MADAR Arabic dialect corpus and lexicon. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), URL: https://aclanthology.org/L18-1535.pdf,
Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., & El-Hajj, W. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4), 72-81. https://doi.org/https://doi.org/10.48550/arXiv.2011.12631
El-Khair, I. A. (2016). Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research (IJRTER), 2(11), 5-13. https://www.researchgate.net/publication/310321022_Abu_El-Khair_Corpus_A_Modern_Standard_Arabic_Corpus
Hasan, S. I., & Kadhim, H. M. (2024, ). SH_ArabicIraqiAccent Corpus. https://github.com/SaraEsHassan/SH_ArabicIraqiAccent
Hasan, S. I., & Kadhim, H. M. (2025). Statistical analysis of F0 for Iraqi Arabic accent corpus. IET Conference Proceedings CP959, DOI: https://doi.org/10.1049/icp.2025.4386,
Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment analysis for modern standard arabic and colloquial. arXiv preprint arXiv:1505.03105. https://doi.org/https://doi.org/10.48550/arXiv.1505.03105
Jarrar, M., Zaraket, F. A., Hammouda, T., Alavi, D. M., & Wählisch, M. (2023). Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic dialect corpora with morphological annotations. 2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), DOI: https://doi.org/10.1109/AICCSA59173.2023.10479250,
Khalifa, S., Habash, N., Abdulrahim, D., & Hassan, S. (2016). A large scale corpus of Gulf Arabic. arXiv preprint arXiv:1609.02960. https://doi.org/https://doi.org/10.48550/arXiv.1609.02960
Madisetti, V. (2018). Video, speech, and audio signal processing and associated standards. CRC Press. https://doi.org/https://doi.org/10.1201/9781315219820
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. Proceedings of the 29th Pacific Asia conference on language, information and computation, URL: https://aclanthology.org/Y15-1004.pdf,
Meyer, C. F., & Nelson, G. (2020). Data collection. The handbook of English linguistics, 81-101. https://doi.org/https://doi.org/10.1002/9781119540618.ch6
Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends® in Signal Processing, 1(1–2), 1-194. https://doi.org/https://doi.org/10.1561/2000000001
Sachdeva, P., Barreto, R., Bacon, G., Sahn, A., Von Vacano, C., & Kennedy, C. (2022). The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, URL: https://aclanthology.org/2022.nlperspectives-1.11.pdf,
Sherif, M. A., & Ngonga Ngomo, A.-C. (2015). Semantic quran: A multilingual resource for natural-language processing. Semantic Web, 6(4), 339-345. https://doi.org/https://doi.org/10.3233/SW-140137
Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171-202. https://doi.org/https://doi.org/10.1162/COLI_a_00169

References

Abbas, N. F., Qasim, T. A., & Jasim, H. A.-S. (2023). Request Constructions in Classical Arabic versus Modern Arabic. Journal of Ethnic and Cultural Studies, 10(5), 1-15. https://doi.org/https://doi.org/10.29333/ejecs/1598

Al-Haff, K., Jarrar, M., Hammouda, T., & Zaraket, F. (2022). Curras+ baladi: Towards a levantine corpus. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2205.09692,

Al-Thubaity, A. O. (2015). A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation, 49(3), 721-751. https://doi.org/https://doi.org/10.1007/s10579-014-9284-1

Alotaibi, H. M. (2016). AEPC: Designing an Arabic/English parallel corpus. Research in Corpus Linguistics, 1-7. https://doi.org/https://ricl.aelinco.es/index.php/ricl/article/view/36/22

Alrayzah, A., Alsolami, F., & Saleh, M. (2024). AraFast: Developing and evaluating a comprehensive modern standard arabic corpus for enhanced natural language processing. Applied Sciences, 14(12), 5294. https://doi.org/https://doi.org/10.3390/app14125294

Alyafeai, Z., Masoud, M., Ghaleb, M., & Al-shaibani, M. S. (2022). Masader: Metadata sourcing for arabic text and speech data resources. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2110.06744,

Bordonaba-Plou, D., & Jreis-Navarro, L. M. (2023). Linguistic Injustice in Multilingual Technologies: The TenTen Corpus Family as a Case Study. In Multilingual digital humanities (pp. 129-144). Routledge. https://doi.org/https://doi.org/10.4324/9781003393696

Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., & Erdmann, A. (2018). The MADAR Arabic dialect corpus and lexicon. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), URL: https://aclanthology.org/L18-1535.pdf,

Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., & El-Hajj, W. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4), 72-81. https://doi.org/https://doi.org/10.48550/arXiv.2011.12631

El-Khair, I. A. (2016). Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research (IJRTER), 2(11), 5-13. https://www.researchgate.net/publication/310321022_Abu_El-Khair_Corpus_A_Modern_Standard_Arabic_Corpus

Hasan, S. I., & Kadhim, H. M. (2024, ). SH_ArabicIraqiAccent Corpus. https://github.com/SaraEsHassan/SH_ArabicIraqiAccent

Hasan, S. I., & Kadhim, H. M. (2025). Statistical analysis of F0 for Iraqi Arabic accent corpus. IET Conference Proceedings CP959, DOI: https://doi.org/10.1049/icp.2025.4386,

Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment analysis for modern standard arabic and colloquial. arXiv preprint arXiv:1505.03105. https://doi.org/https://doi.org/10.48550/arXiv.1505.03105

Jarrar, M., Zaraket, F. A., Hammouda, T., Alavi, D. M., & Wählisch, M. (2023). Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic dialect corpora with morphological annotations. 2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), DOI: https://doi.org/10.1109/AICCSA59173.2023.10479250,

Khalifa, S., Habash, N., Abdulrahim, D., & Hassan, S. (2016). A large scale corpus of Gulf Arabic. arXiv preprint arXiv:1609.02960. https://doi.org/https://doi.org/10.48550/arXiv.1609.02960

Madisetti, V. (2018). Video, speech, and audio signal processing and associated standards. CRC Press. https://doi.org/https://doi.org/10.1201/9781315219820

Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. Proceedings of the 29th Pacific Asia conference on language, information and computation, URL: https://aclanthology.org/Y15-1004.pdf,

Meyer, C. F., & Nelson, G. (2020). Data collection. The handbook of English linguistics, 81-101. https://doi.org/https://doi.org/10.1002/9781119540618.ch6

Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends® in Signal Processing, 1(1–2), 1-194. https://doi.org/https://doi.org/10.1561/2000000001

Sachdeva, P., Barreto, R., Bacon, G., Sahn, A., Von Vacano, C., & Kennedy, C. (2022). The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, URL: https://aclanthology.org/2022.nlperspectives-1.11.pdf,

Sherif, M. A., & Ngonga Ngomo, A.-C. (2015). Semantic quran: A multilingual resource for natural-language processing. Semantic Web, 6(4), 339-345. https://doi.org/https://doi.org/10.3233/SW-140137

Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171-202. https://doi.org/https://doi.org/10.1162/COLI_a_00169

SH_ArabicIraqiAccent: A Speech Corpus of Iraqi Arabic Speech in Baghdad’s Northern Suburbs

Article Sidebar

Main Article Content

Abstract

Keywords

Article Details

References

References