Critical data studies and living review

Datasets are crucial to medical imaging research, yet issues like dataset management practices, shortcuts, and biases are often overlooked, undermining algorithm generalizability and potentially affecting patient outcomes. In [JA24], I investigated dataset management practices by analyzing the 30-most cited datasets across computer vision, NLP, and medical imaging. The study focused on key aspects such as data sharing, documentation, and distribution. This analysis revealed systematic issues: vague licensing terms, lack of metadata (e.g., on patient/scanner characteristics, raising the risk of data leakage), and duplicate entries on platforms like Kaggle and HuggingFace. These issues hinder reproducibility and may raise ethical concerns, leading to dataset retraction. Tracking dataset versions and citations also remains difficult [ST24]. Our recommendations aim to strengthen data governance and support fair, reliable ML in healthcare.

Despite the critical importance of data, current efforts do not consider the evolving nature of datasets, failing to incorporate emerging evidence (e.g., shortcuts [JA23b], biases or new annotations [CV25a]). We refer to these emerging findings as research artifacts. In [JA25], I proposed a framework for a living review that continuously tracks public datasets and their evolving artifacts across diverse medical imaging tasks. A public demo is available at http://inthepicture.itu.dk/. The research presented in this work was in the context of a year-long collaborative webinar (where I am co-organizer), and an in-person workshop I co-organized. These events brought together a group of around 50 researchers from academia, industry and clinicians, with research experience from 10+ countries in five continents.

Through these last works, I have developed expertise in mixed methods from social sciences, combining qualitative and quantitative approaches. This socio-technical perspective will inform the design of more robust, data-centric ML methods.

Publications

[JA25] In the picture: medical imaging datasets, artifacts, and their living review
Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Sarah de Boer, Víctor M. Campello, Aasa Feragen, Enzo Ferrante, Melanie Ganz, Judy Wawira Gichoya, Camila González, Steff Groefsema, Alessa Hering, Adam Hulman, Leo Joskowicz, Dovile Juodelyte, Melih Kandemir, Thijs Kooi, Jorge del Pozo Lérida, Livie Yumeng Li, Andre Pacheco, Tim Rädsch, Mauricio Reyes, Théo Sourget, Bram van Ginneken, David Wen, Nina Weng, Jack Junchi Xu, Hubert Dariusz Zając, Maria A. Zuluaga, Veronika Cheplygina
FAccT 2025
PDF   Demo

[CV25] Augmenting chest x-ray datasets with non-expert annotations
Veronika Cheplygina, Cathrine Damgaard, Trine Naja Eriksen, Dovile Juodelyte, Amelia Jiménez-Sánchez
MIUA 2025
PDF Code Dataset

[JA24] Copycats: the many lives of a publicly available medical imaging dataset
Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Hubert Dariusz Zając, Anna Rogers, Veronika Cheplygina
NeurIPS 2024 Track on Datasets and Benchmarks
PDF   Slides   Poster

[ST24] [Citation needed] Data usage and citation practices in medical imaging conferences
Théo Sourget, Ahmet Akkoç, Stinna Winther, Christine Lyngbye Galsgaard, Amelia Jiménez-Sánchez, Dovile Juodelyte, Caroline Petitjean, Veronika Cheplygina
Medical Imaging with Deep Learning -- MIDL 2024 [oral]
PDF   Code   Tool

[JA23b] Detecting shortcuts in medical images — a case study in chest x-rays
Amelia Jiménez-Sánchez, Dovile Juodelyte, Bethany Chamberlain, Veronika Cheplygina
ISBI 2023
PDF Code

Webinar

We are organizing a webinar series: Datasets through the L👀king-Glass to better understand what researchers are doing with their (meta-) data.

Workshop

We organized a 2-days workshop in September 2024 in Nyborg Strand (DK) In the Picture: Medical Imaging Datasets focused on the challenges within medical imaging datasets that hinder the development of fair and robust AI algorithms. We had several invited talks, and mostly group sessions that focused on engagement and collaboration.

Funding

  • DFF (Independent Research Council Denmark) Inge Lehmann 1134-00017B.
  • DDSA (Danish Data Science Academy) Large Event. Grant ID: 2024-2324.