Vol 1, No 1 (2011)

Special Issue on Parallel Corpora: Annotation, Exploitation, Evaluation



Stella Neumann, Oliver Čulo, Silvia Hansen-Schirra

Exchange between the translation studies and the computational linguistics communities has traditionally not been very intense. Among other things, this is reflected by the different views on parallel corpora. While computational linguistics does not always strictly pay attention to the translation direction (e.g. when translation rules are extracted from (sub)corpora which actually only consist of translations), translation studies are amongst other things concerned with exactly comparing source and target texts (e.g. to draw conclusions on interference and standardization effects). However, there has recently been more exchange between the two fields – especially when it comes to the annotation of parallel corpora. This special issue brings together the different research perspectives. Its contributions show – from both perspectives – how the communities have come to interact in recent years.

Keywords translation studies; parallel corpora

Building and Querying Parallel Treebanks

Martin Volk, Torsten Marek, Yvonne Samuelsson

This paper describes our work on building a trilingual parallel treebank. We have annotated constituent structure trees from three text genres (a philosophy novel, economy reports and a technical user manual). Our parallel treebank includes word and phrase alignments. The alignment information was manually checked using a graphical tool that allows the annotator to view a pair of trees from parallel sentences. This tool comes with a powerful search facility which supersedes the expressivity of previous popular treebank query engines.

Enriching Slovene WordNet with domain-specific terms

Špela Vintar, Darja Fišer

The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual resources to facilitate the mapping of terms to the Slovene Wordnet. We first identify the core terms of the domain in English using the Princeton Wordnet, and then we translate them into Slovene using a bilingual lexicon produced from the parallel corpus. In the next step we extract multi-word terms from the Slovene domain-specific corpus using a hybrid approach, and finally match the term candidates to existing Wordnet synsets. The proposed method appears to be a successful way to improve the domain coverage of Wordnet as it yields abundant term candidates and exploits various multilingual resources.

Keywords: Wordnet construction; multi-word expressions; parallel corpora; term extraction; Slovene Wordnet

Empty links and crossing lines: querying multi-layer annotation and alignment in parallel corpora

Oliver Čulo, Silvia Hansen-Schirra, Karin Maksymski, Stella Neumann

Translation shifts can be informative in various ways. Amongst other things, they can point to typological differences between languages or be indicators of properties of translated text like e.g. explicitation or normalisation. Detecting translation shifts in parallel corpora is thus a major task from the viewpoint of translation studies. This paper presents an analysis of translation shifts in a parallel corpus (English-German). It offers an operationalisation of queries which can exploit multi-layer annotation and alignment in order to detect various kinds of translation shifts across category boundary lines and empty alignment links. The paper furthermore discusses the shifts and links them to certain translation properties.

Keywords: parallel corpora; multi-layer annotation and alignment; corpus query; translation studies; translation shifts; translation properties

On drafting and revision in translation: a corpus linguistics oriented analysis of translation process data

Fabio Alves, Daniel Couto Vale

This paper reports on a study which investigates prototypical characteristics of the drafting and revision phases of the translation process, mapped onto the sequential unfolding of micro translation units into macro translation units (MTUs). By using Litterae, an annotation and search tool designed to mark, annotate and extract XML files of key-logged translation process data, the paper analyses the performance of 12 professional translators and classifies their output as MTUs grouped into three categories: MTUs containing micro units which are processed solely during the drafting phase (P1 type), MTUs containing micro units which are processed once in the drafting phase and finalized in the revision phase (P2 type), and MTUs containing micro units which are processed during the drafting phase and taken up again during the revision phase (P3 type). The analysis points to a hierarchical structure in which P1 is more predominant than P2 which, in turn, is more frequent than P3.

Keywords: drafting and revision patterns in translation; micro and macro translation units; semi-automatic analysis of translation process data; corpus linguistics oriented approach

Altera Lingua: Computerlinguistik in der Dolmetschpraxis unter besonderer Berücksichtigung der Korpusanalyse

Claudio Fantinuoli

This contribution demonstrates an even more practice-oriented exploitation of corpora, both monolingual and parallel. It describes the design of a software, InterpretBank, which supports conference interpreters in all stages of their work. Based on Baroni and Bernardini’s (2004) BootCat mechanism, the program harvests the web for domain-specific documents given a set of search terms, performs term extraction on them and uses additional resources, e.g. Wikipedia or bilingual online dictionaries, to propose definitions, translations, collocations and keyword-in-context information. All available modules, for harvesting, management and retrieval, are adapted to the specific needs of interpreters, reducing the time needed for preparation and allowing for efficient retrieval while interpreting. A pilot module adds the possibility to include parallel resources, e.g. translation memories or the OPUS corpora, in the preparation phase.

ISSN: 2193-6986