We aim to develop large-Scale PARallel Corpora to study LINGuistic variation (SPARCLING). These can be used to study variation in linguistic patterns across different languages. The focus in the linguistic part of the project is on variable article use. We use contexts where one language in a language pair uses an article to retrieve instances where the other language does not require an article. Such zero contexts are notoriously difficult to access in monolingual corpora.
Translated documents in multiple languages are a valuable resource for various tasks in natural language processing and linguistic research. Their usefulness for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts on both the sentence and the word level. We will align and annotate (PoS-Tagging and Parsing) a large parallel corpus for several language pairs.
Linguistic variation at times involves the choice between the use of an element and its omission. Zero elements are difficult to retrieve, however. We use parallel corpora to target constructions with variable optional elements in one of the languages. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular.
Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or use them differently. The aim of the project is to arrive at a detailed description of variable article use. This will prove useful for purposes of language teaching and machine translation.
The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.