Wikipedia Parallel Corpus

Parallel corpora are essential resources for many learning-based cross-lingual and multi-lingual applications, such as statistical machine translation (SMT) and cross-language information retrieval (CLIR) systems. "Wikipedia Parallel Corpus" is an automatically extracted parallel corpus, built based on Wikipedia artciles. This parallel corpus is available for the English-Persian language pair. Statistics of this parallel corpus is reported in Table 1.

 

This parallel corpus is extracted using a novel approach based on integer linear programming (ILP). It is estensively evaluated and the results indicate the high quality of this corpus compared to the existing English-Persian parallel corpora. More detail is available in [1].

 

Table 1. Statistics of the Wikipedia Parallel Corpus

Persian English  
283,486 283,486 Total number of alignments
2,345,762 2,342,786 Total number of words
8.2747 8.2642 Average number of words per sentence
117,794 127,793 Total number of unique words

 

 

This corpus is freely available for research purposes. The corpus can be downloaded from here (mirror link).

 

Two probabilistic dictionaries (English-to-Persian and Persian-to-English) are learned on the Wikipedia parallel corpus based on the IBM Model 1 using the Moses toolkit. These dictionaries are also available here (mirror link).

 

Please cite the following article, if you found this corpus useful in your reseaech.

 

[1] Hamed Zamani, Heshaam Faili, Azadeh Shakery, "Sentence alignment using local and global information", In Computer Speech & Language. Link: http://www.sciencedirect.com/science/article/pii/S0885230816300572

 

Please do not hesitate to contact us if you have any questions regarding this corpus.

 

Project Provider

Hamed Zamani
Email: h.zamani@ut.ac.ir
Former Member

Supervisor Professor

Heshaam Faili
Associate Professor
Tel Number: 61119717
Email: hfaili [AT] ut.ac.ir
Azadeh Shakery
Assistant Professor
Tel Number: 61119722
Email: shakery [AT] ut.ac.ir