PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification

Dominique Brunato1, Andrea Cimino2, Felice Dell'Orletta2, Giulia Venturi2
1Institute of Computational Linguistics "A. Zampolli" (ILC-CNR), Pisa, 2Institute of Computational Linguistics "A. Zampolli", ILC-CNR, Pisa


Abstract

In this paper we present PaCCSS--IT, a Parallel Corpus of Complex--Simple Sentences for ITalian. To build the resource we develop a new method for automatically acquiring a corpus of complex--simple paired sentences able to intercept structural transformations and particularly suitable for text simplification. The method requires a wide amount of texts that can be easily extracted from the web making it suitable also for less--resourced languages. We test it on the Italian language making available the biggest Italian corpus for automatic text simplification.