OATAO - Open Archive Toulouse Archive Ouverte Open Access Week

Splitting Arabic Texts into Elementary Discourse Units

Keskes, Iskandar and Benamara Zitoune, Farah and Hadrich Belguith, Lamia Splitting Arabic Texts into Elementary Discourse Units. (2014) ACM Transactions on Asian Language Information Processing, 13 (2). 1-23. ISSN 1530-0226

[img] (Document in English)

PDF (Author's version) - Depositor and staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Official URL: http://dx.doi.org/10.1145/2601401

Abstract

In this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system.

Item Type:Article
Additional Information:Thanks to ACM editor. The definitive version is available at http://dl.acm.org/citation.cfm?doid=2636326.2601401
HAL Id:hal-01120621
Audience (journal):International peer-reviewed journal
Uncontrolled Keywords:
Institution:French research institutions > Centre National de la Recherche Scientifique - CNRS (FRANCE)
Université de Toulouse > Institut National Polytechnique de Toulouse - INPT (FRANCE)
Université de Toulouse > Université Toulouse III - Paul Sabatier - UPS (FRANCE)
Université de Toulouse > Université Toulouse - Jean Jaurès - UT2J (FRANCE)
Université de Toulouse > Université Toulouse 1 Capitole - UT1 (FRANCE)
Other partners > Université de Sfax (TUNISIA)
Laboratory name:
Statistics:download
Deposited By: IRIT IRIT
Deposited On:26 Feb 2015 10:34

Repository Staff Only: item control page