CLARA-MeD corpus

Name: CLARA-MeD corpus
Creator: Agencia Estatal Consejo Superior de Investigaciones Científicas
License: https://creativecommons.org/licenses/by-nc-sa/4.0/
Keywords: None

Publisher Agencia Estatal Consejo Superior de Investigaciones Científicas

Administration level State Administration

Entity

Public

License

https://creativecommons.org/licenses/by-nc-sa/4.0/

Description

A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022.

Data

Information

Show technical data sheet of the dataset.

Technical sheet

Distributions(2)

Identification Interoperability

Identifier	https://digital.csic.es/bitstream/10261/269887/1/CLARA-MeD-corpus.zip
Access point URL	https://digital.csic.es/bitstream/10261/269887/1/CLARA-MeD-corpus.zip

Format	ZIP
Size	196.13 MB

Identification Interoperability

Identifier	https://digital.csic.es/bitstream/10261/269887/4/README.txt
Access point URL	https://digital.csic.es/bitstream/10261/269887/4/README.txt

Format	plain
Size	8.1 KB

Keywords
Tags	Biomedical natural ... Comparable corpus Medical text simpli... Parallel sentences
Categories
Categories	Science and technology Healthcare
Coverage
Geographic coverage	Spain
Time coverage	From 14/05/2022 22:00 (UTC) to 14/05/2022 22:00 (UTC)
Language
Languages	Spanish English

Identification
Identifier	http://hdl.handle.net/10261/269887
Last updated	18/05/2022 22:00 (UTC)
Creation date	18/05/2022 22:00 (UTC)
References
Other resources	https://github.com/lcampillos/CLARA-MeD/corpus

Language

You are here

CLARA-MeD corpus