The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
Other topics in Computer Science
|Authors:||Vassilis Papavassiliou; Prokopis Prokopidis; Stelios Piperidis|
|Book title:||Proceedings of the First Conference on Machine Translation|
This paper describes ILSP-ARC-pv42, the Institute for Language and Speech Processing/Athena Research and Innovation Center submission for the WMT 2016 Bilingual Document Alignment shared task. We describe several document and collection-aware features that our system explored in the context of the task. On the test dataset, our submission achieved a recall of 84.93%, even though it does not make use of any language-specific resources like bilingual lexica or MT output. Instead, our system is based on shallow features (including links to documents in the same webdomain, URLs, digits, image filenames and HTML structure) that can be easily extracted from web documents. We also present examples to show that when de-duplication issues in the test dataset are properly addressed, our system reaches a significantly higher recall of 92.5%.