Making MetaData Count

Machine learning has shown promising results in medical image diagnosis, at times with claims of expert-level performance. The availability of large public datasets have shifted the interest of the medical community to high-performance algorithms. However, little attention is paid to the quality of the data or annotations. Algorithms with high reported performances have been shown to suffer from overfitting or shortcuts, i.e. spurious correlations between artifacts in images and diagnostic labels. Examples include pen marks in skin lesion classification, patient position in detection of COVID-19, and chest drains in pneumothorax classification. Performance may appear high when training and evaluating on data with shortcuts, but degraded when the shortcut is removed. This happens because the algorithm cannot generalize based on the actual features related to the diagnosis.

Our goal is to redefine how meta-data is used and thus improve the robustness of algorithms. We plan to:

investigate what kind of different shortcuts (based on demographics or image artefacts) might occur and how these affect the performance and fairness of the algorithms ⚖️.
investigate meta-data-aware methods to try to avoid learning biases or shortcuts ⚔️🛡.

Publications

Augmenting Chest X-ray Datasets with Non-Expert Annotations
Veronika Cheplygina, Cathrine Damgaard, Trine Naja Eriksen, Dovile Juodelyte, Amelia Jiménez-Sánchez
MIUA 2025

In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review
Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Sarah de Boer, Víctor M. Campello, Aasa Feragen, Enzo Ferrante, Melanie Ganz, Judy Wawira Gichoya, Camila González, Steff Groefsema, Alessa Hering, Adam Hulman, Leo Joskowicz, Dovile Juodelyte, Melih Kandemir, Thijs Kooi, Jorge del Pozo Lérida, Livie Yumeng Li, Andre Pacheco, Tim Rädsch, Mauricio Reyes, Théo Sourget, Bram van Ginneken, David Wen, Nina Weng, Jack Junchi Xu, Hubert Dariusz Zając, Maria A. Zuluaga, Veronika Cheplygina
FAccT 2025

Copycats: the many lives of a publicly available medical imaging dataset
Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Hubert Dariusz Zając, Anna Rogers, Veronika Cheplygina
NeurIPS 2024 Track on Datasets and Benchmarks

Detecting Shortcuts in Medical Images — A Case Study in Chest X-rays
Amelia Jiménez-Sánchez, Dovile Juodelyte, Bethany Chamberlain, Veronika Cheplygina
ISBI 2023

Webinar

We are organizing a webinar series: Datasets through the L👀king-Glass to better understand what researchers are doing with their (meta-) data.

Workshop

We organized a 2-days workshop in Nyborg Strand (DK) In the Picture: Medical Imaging Datasets focused on the challenges within medical imaging datasets that hinder the development of fair and robust AI algorithms. We had several invited talks, and mostly group sessions that focused on engagement and collaboration.

Funding

DFF (Independent Research Council Denmark) Inge Lehmann 1134-00017B.
DDSA (Danish Data Science Academy) Large Event. Grant ID: 2024-2324.

Share on

Twitter Facebook LinkedIn