Align-it: Texts alignment
RESEARCH OUTPUTS

Align-it: Texts alignment

Category: Services

Two languages can be more informative than one. Clearly, a corpus of properly aligned text constitutes an extremely valuable source of information, not only for researchers in lexicography and terminology, but also for a wide range of applications, including Translation Memories, bilingual dictionaries and lexicons, etc.

Alignment is the process of establishing correspondences between text units and producing corpora of parallel texts. The alignment resolution level depends on the nature of its specific application; for example, a "gross" alignment would be the establishment of correspondences between text paragraphs, while a very refined level of alignment could show correspondences at a word level. The most usual and useful level is sentence alignment.

Sentence aligning by hand is not only time and cost consuming, but also presupposes accurate skills of individuals with a good knowledge of all language pairs that are to be aligned. For that reason, there exist software applications that produce reliable alignments at a minimal cost.

Align-it is capable of producing high quality sentence alignments for an entire bilingual corpus. This is a very useful output, both at a text pre-processing level, for the implementation of other specialized tools (for example, translation memories) and as a translation revision and evaluation tool.

Furthermore, Align-it provides the user with the opportunity to see a text and its translation side-by-side, with explicit connections between text elements. It can also facilitate deeper automatic analysis of translations; for example, it can be used to mark possible omissions in a translation, or to signal common translation mistakes, such as terminological inconsistencies.

Align-it treats six aligning cases:

  • one sentence translates into one
  • a sentence is not translated at all
  • a new sentence that has no equivalent in the source text is introduced by the translator
  • two consecutive sentences of the source text translate into one
  • one source sentence translates into two
  • two source sentences translate into one

For the languages belonging to the occidental family, most sentences in one language match exactly one sentence in the other language; however, it is possible for a sentence to match two or more (or even no) sentences in the other language.

The sentence alignment task is to identify correspondences between sentences in one language and sentences in the other language. This is the first step toward the more ambitious task of finding correspondences between words.

The input of Align-it is a pair of texts such as in the following figure (the extract is from Genesis in the Bible corpus):

In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the spirit of God moved upon the face of the waters. Au commencement, Dieu crea les cieux et la terre. La terre etait informe et vide: il y avait des tenebres a la surface de l'abime, et l'esprit de Dieu se mouvait au-dessus des eaux.

The output identifies the alignment between sentences. The first two alignments in the next figure illustrate the typical case where one English sentence aligns with exactly one French sentence, while the third one illustrates a one-to-two alignment.

Pair 1 In the beginning God created the heaven and the earth. Au commencement, Dieu crea les cieux et la terre.
Pair 2 And the earth was without form, and void; La terre etait informe et vide:
Pair 3 and darkness was upon the face of the deep.

And the spirit of God moved upon the face of the waters.
il y avait des tenebres a la sufrace de l'abime, et l'esprit de Dieu se mouvait au-dessus des eaux.

Align-it further offers:

  • Recognition of extra-linguistic elements (abbreviations and acronyms, dates and numbers)
  • Disambiguation of punctuation marks
  • Misalignment correction facilities

Align-it operates in a Windows environment through a Graphical User Interface with the use of a mouse/keyboard. The program provides the user with an editor that supports basic editing facilities, such as creating a new text or opening an existing one, cutting and pasting, saving and printing. The editor also supports opening a corpus file that includes a list of parallel texts, selecting a pair from the list and aligning the respective texts. The corpus file also contains the delimiters used to identify the paragraph and the sentence boundaries within the corpus. In the case of the alignment of two separated files, these delimiters can be input by the user in a specific delimiter dialog box.

The alignment output can be presented in several ways. The user can interfere in the case that a misalignment occurs and manually correct the error. The final alignment can be stored to an output file containing either the aligned chunks of text or the aligned offsets of text.

In the environment of Align-it, a handler is also integrated as a pre-alignment tool that is able to recognize automatically the sentence boundaries and mark them appropriately.

 

Contact person: Stelios Piperidis

 
 

Research departments

Research areas