Enriching Odia Language Resources: Our Collective Effort

On Jul 21, 2020

A multilingual country like India needs language corpora for low resource languages not only to provide its citizens with technologies of natural language processing (NLP) readily available in other countries, but also to support its people in their education and cultural needs.

Odia (also spelled as Oriya) is an Indo-Aryan language spoken in the Indian state of Odisha. Apart from Odisha, there are significant Odia speaking populace in five neighbouring states like Andhra Pradesh, Madhya Pradesh, West Bengal, Jharkhand, and Chhattisgarh besides one neighbouring country Bangladesh. In fact, Odia is categorized as a classical Indian language (the sixth Indian language to have this prestigious status out of 23 official languages) with a literary history of more than 1000 years. Now a days, Odia is spoken by around 50 million people.

We would always be grateful to the volunteers who supported in the development process of corpora (OdiEnCorp 1.0, and OdiEnCorp 2.0). The released OdiEnCorp 2.0 is freely available for research and non-commercial purpose and helps many NLP researchers for developing machine translation systems utilizing our data and performing research in this direction. As more researchers showed their interest and joined with us, we expect to enrich the available resources and build new NLP resources for Odia language.

References:

[1] Parida, S., Dash, S. R., Bojar, O., Motlıcek, P., Pattnaik, P., & Mallick, D. K. OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation. In LREC 2020 Workshop Language Resources and Evaluation Conference 11–16 May 2020, Marseille, France (p. 14).

[2] Parida, S., Bojar, O., & Dash, S. R. (2019, September). OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation. In Smart Intelligent Computing and Applications: Proceedings of the Third International Conference on Smart Computing and Informatics (Vol. 1, p. 495). Springer Nature.

The members involved in OdiEnCorp1.0 and OdiEnCorp2.0 are attached herewith.