Main Article Content

Abstract

This paper presents the SH_ArabicIraqiAccent corpus, a newly developed speech corpus that captures spontaneous Iraqi Arabic from the northern suburbs of Baghdad. This corpus comprises spontaneous speech from 47 speakers (28 females aged 9–59 and 19 males aged 20–63) discussing both personal and general community topics relevant to the region. The recordings were made in a controlled environment at a sampling rate of 16 kHz, with 16-bit resolution and a mono channel, to ensure clarity and fidelity. The total duration is approximately 2.5 hours and is professionally recorded with minimal background noise. The corpus addresses the need for dialect-specific resources, as Modern Standard Arabic corpora do not adequately represent regional spoken varieties. The dataset was carefully designed to address key parameters, including speaker diversity (age and gender), phoneme coverage, and recording quality, and supports applications in natural language processing (NLP), automatic speech recognition (ASR), text-to-speech (TTS), and speaker identification. The corpus complements existing resources by focusing on underrepresented Iraqi Arabic dialects and offers detailed acoustic analyses, including fundamental frequency and intensity measurements. This work situates the SH_ArabicIraqiAccent alongside other primary Arabic linguistic resources and highlights its potential to enhance understanding of Arabic dialects and to support model development. The corpus and relevant statistical analyses are publicly available and contribute to Arabic speech and language research, particularly for dialectal NLP tasks.


 

Keywords

Array Array Array Array

Article Details

References

  1. Abbas, N. F., Qasim, T. A., & Jasim, H. A.-S. (2023). Request Constructions in Classical Arabic versus Modern Arabic. Journal of Ethnic and Cultural Studies, 10(5), 1-15. https://doi.org/https://doi.org/10.29333/ejecs/1598
  2. Al-Haff, K., Jarrar, M., Hammouda, T., & Zaraket, F. (2022). Curras+ baladi: Towards a levantine corpus. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2205.09692,
  3. Al-Thubaity, A. O. (2015). A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation, 49(3), 721-751. https://doi.org/https://doi.org/10.1007/s10579-014-9284-1
  4. Alotaibi, H. M. (2016). AEPC: Designing an Arabic/English parallel corpus. Research in Corpus Linguistics, 1-7. https://doi.org/https://ricl.aelinco.es/index.php/ricl/article/view/36/22
  5. Alrayzah, A., Alsolami, F., & Saleh, M. (2024). AraFast: Developing and evaluating a comprehensive modern standard arabic corpus for enhanced natural language processing. Applied Sciences, 14(12), 5294. https://doi.org/https://doi.org/10.3390/app14125294
  6. Alyafeai, Z., Masoud, M., Ghaleb, M., & Al-shaibani, M. S. (2022). Masader: Metadata sourcing for arabic text and speech data resources. Proceedings of the Thirteenth Language Resources and Evaluation Conference, DOI: https://doi.org/10.48550/arXiv.2110.06744,
  7. Bordonaba-Plou, D., & Jreis-Navarro, L. M. (2023). Linguistic Injustice in Multilingual Technologies: The TenTen Corpus Family as a Case Study. In Multilingual digital humanities (pp. 129-144). Routledge. https://doi.org/https://doi.org/10.4324/9781003393696
  8. Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., & Erdmann, A. (2018). The MADAR Arabic dialect corpus and lexicon. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), URL: https://aclanthology.org/L18-1535.pdf,
  9. Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., & El-Hajj, W. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4), 72-81. https://doi.org/https://doi.org/10.48550/arXiv.2011.12631
  10. El-Khair, I. A. (2016). Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research (IJRTER), 2(11), 5-13. https://www.researchgate.net/publication/310321022_Abu_El-Khair_Corpus_A_Modern_Standard_Arabic_Corpus
  11. Hasan, S. I., & Kadhim, H. M. (2024, ). SH_ArabicIraqiAccent Corpus. https://github.com/SaraEsHassan/SH_ArabicIraqiAccent
  12. Hasan, S. I., & Kadhim, H. M. (2025). Statistical analysis of F0 for Iraqi Arabic accent corpus. IET Conference Proceedings CP959, DOI: https://doi.org/10.1049/icp.2025.4386,
  13. Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment analysis for modern standard arabic and colloquial. arXiv preprint arXiv:1505.03105. https://doi.org/https://doi.org/10.48550/arXiv.1505.03105
  14. Jarrar, M., Zaraket, F. A., Hammouda, T., Alavi, D. M., & Wählisch, M. (2023). Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic dialect corpora with morphological annotations. 2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), DOI: https://doi.org/10.1109/AICCSA59173.2023.10479250,
  15. Khalifa, S., Habash, N., Abdulrahim, D., & Hassan, S. (2016). A large scale corpus of Gulf Arabic. arXiv preprint arXiv:1609.02960. https://doi.org/https://doi.org/10.48550/arXiv.1609.02960
  16. Madisetti, V. (2018). Video, speech, and audio signal processing and associated standards. CRC Press. https://doi.org/https://doi.org/10.1201/9781315219820
  17. Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. Proceedings of the 29th Pacific Asia conference on language, information and computation, URL: https://aclanthology.org/Y15-1004.pdf,
  18. Meyer, C. F., & Nelson, G. (2020). Data collection. The handbook of English linguistics, 81-101. https://doi.org/https://doi.org/10.1002/9781119540618.ch6
  19. Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends® in Signal Processing, 1(1–2), 1-194. https://doi.org/https://doi.org/10.1561/2000000001
  20. Sachdeva, P., Barreto, R., Bacon, G., Sahn, A., Von Vacano, C., & Kennedy, C. (2022). The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, URL: https://aclanthology.org/2022.nlperspectives-1.11.pdf,
  21. Sherif, M. A., & Ngonga Ngomo, A.-C. (2015). Semantic quran: A multilingual resource for natural-language processing. Semantic Web, 6(4), 339-345. https://doi.org/https://doi.org/10.3233/SW-140137
  22. Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171-202. https://doi.org/https://doi.org/10.1162/COLI_a_00169