Authors:
Khadim Dramé
1
;
2
;
Gorgoumack Sambe
1
;
2
and
Gayo Diallo
3
Affiliations:
1
Laboratoire d’Informatique et d’Ingénierie pour l’Innovation, Ziguinchor, Senegal
;
2
Université Assane Seck de Ziguinchor, Ziguinchor, Senegal
;
3
SISTM - INRIA, BPH INSERM 1219, Univ. Bordeaux, Bordeaux, France
Keyword(s):
Text Duplicatoin, Semantic Sentence Similarity, Multilayer Perceptron, French Clinical Notes.
Abstract:
Detecting similar sentences or paragraphs is a key issue when dealing with texts duplication. This is particularly the case for instance in the clinical domain for identifying the same multi-occurring events. Due to lack of resources, this task is a key challenge for French clinical documents. In this paper, we introduce CONCORDIA, a semantic similarity computing approach between sentences within French clinical texts based on supervised machine learning algorithms. After briefly reviewing various semantic textual similarity measures reported in the literature, we describe the approach, which relies on Random Forest, Multilayer Perceptron and Linear Regression algorithms to build supervised models. These models are thereafter used to determine the degree of semantic similarity between clinical sentences. CONCORDIA is evaluated using the Spearman correlation and EDRM classical evaluation metrics on standard benchmarks provided in the context of the Text Mining DEFT 2020 challenge base
d. According to the official DEFT 2020 challenge results, the CONCORDIA Multilayer Perceptron based algorithm achieves the best performances compared to all the other participating systems, reaching an EDRM of 0.8217.
(More)