CLARA-MeD corpus

Descripció

A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022.

Distribucions

  • CLARA-MeD-corpus.zip
    application/x-zip-compressed
    ZIP
    205657210 Bytes
  • README.txt
    text/plain
    plain
    8294 Bytes

Informació addicional

Data de creació 18/05/2022 22:00 (UTC)
Data d'última actualització 18/05/2022 22:00 (UTC)
Cobertura temporal
  • De 14/05/2022 22:00 (UTC) a 14/05/2022 22:00 (UTC)
Cobertura geogràfica España
Idiomes
  • Espanyol
  • Anglès
Altres recursos