Publication - The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
RESEARCH

The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task

Research Area:  
Other topics in Computer Science
    
Type:  
In Proceedings

 

Year: 2016
Authors: Vassilis Papavassiliou; Prokopis Prokopidis; Stelios Piperidis
Book title: Proceedings of the First Conference on Machine Translation
Pages: 733-739
Address: Berlin, Germany
Date: August
Abstract:
This paper describes ILSP-ARC-pv42, the Institute for Language and Speech Processing/Athena Research and Innovation Center submission for the WMT 2016 Bilingual Document Alignment shared task. We describe several document and collection-aware features that our system explored in the context of the task. On the test dataset, our submission achieved a recall of 84.93%, even though it does not make use of any language-specific resources like bilingual lexica or MT output. Instead, our system is based on shallow features (including links to documents in the same webdomain, URLs, digits, image filenames and HTML structure) that can be easily extracted from web documents. We also present examples to show that when de-duplication issues in the test dataset are properly addressed, our system reaches a significantly higher recall of 92.5%.
[Bibtex]