Is this good enough on expert perception of brain tumor segmentation quality

Katharina Hoebel; Christopher P. Bridge; Sara Ahmed; Oluwatosin Akintola; Caroline Chung; Raymond Huang; Jason Johnson; Albert Kim; K. Ina Ly; Ken Chang; Jay Patel; Marco Pinho; Tracy T. Batchelor; Bruce Rosen; Elizabeth Gerstner; Jayashree Kalpathy-Cramer

doi:10.1117/12.2611810

Is this good enough on expert perception of brain tumor segmentation quality

Katharina Hoebel, Christopher P. Bridge, Sara Ahmed, Oluwatosin Akintola, Caroline Chung, Raymond Huang, Jason Johnson, Albert Kim, K. Ina Ly, Ken Chang, Jay Patel, Marco Pinho, Tracy T. Batchelor, Bruce Rosen, Elizabeth Gerstner, Jayashree Kalpathy-Cramer

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.

Original language	English (US)
Title of host publication	Medical Imaging 2022
Subtitle of host publication	Image Perception, Observer Performance, and Technology Assessment
Editors	Claudia R. Mello-Thoms, Claudia R. Mello-Thoms, Sian Taylor-Phillips
Publisher	SPIE
ISBN (Electronic)	9781510649453
DOIs	https://doi.org/10.1117/12.2611810
State	Published - 2022
Event	Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment - Virtual, Online Duration: Mar 21 2022 → Mar 27 2022

Publication series

Name	Progress in Biomedical Optics and Imaging - Proceedings of SPIE
Volume	12035
ISSN (Print)	1605-7422

Conference

Conference	Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment
City	Virtual, Online
Period	3/21/22 → 3/27/22

Keywords

deep learning
inter-rater variability
quality assessment
segmentation

ASJC Scopus subject areas

Electronic, Optical and Magnetic Materials
Atomic and Molecular Physics, and Optics
Biomaterials
Radiology Nuclear Medicine and imaging

Access to Document

10.1117/12.2611810

Cite this

Hoebel, K., Bridge, C. P., Ahmed, S., Akintola, O., Chung, C., Huang, R., Johnson, J., Kim, A., Ina Ly, K., Chang, K., Patel, J., Pinho, M., Batchelor, T. T., Rosen, B., Gerstner, E., & Kalpathy-Cramer, J. (2022). Is this good enough on expert perception of brain tumor segmentation quality. In C. R. Mello-Thoms, C. R. Mello-Thoms, & S. Taylor-Phillips (Eds.), Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment Article 120350P (Progress in Biomedical Optics and Imaging - Proceedings of SPIE; Vol. 12035). SPIE. https://doi.org/10.1117/12.2611810

Is this good enough on expert perception of brain tumor segmentation quality. / Hoebel, Katharina; Bridge, Christopher P.; Ahmed, Sara et al.
Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment. ed. / Claudia R. Mello-Thoms; Claudia R. Mello-Thoms; Sian Taylor-Phillips. SPIE, 2022. 120350P (Progress in Biomedical Optics and Imaging - Proceedings of SPIE; Vol. 12035).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Hoebel, K, Bridge, CP, Ahmed, S, Akintola, O, Chung, C, Huang, R, Johnson, J, Kim, A, Ina Ly, K, Chang, K, Patel, J, Pinho, M, Batchelor, TT, Rosen, B, Gerstner, E & Kalpathy-Cramer, J 2022, Is this good enough on expert perception of brain tumor segmentation quality. in CR Mello-Thoms, CR Mello-Thoms & S Taylor-Phillips (eds), Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment., 120350P, Progress in Biomedical Optics and Imaging - Proceedings of SPIE, vol. 12035, SPIE, Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment, Virtual, Online, 3/21/22. https://doi.org/10.1117/12.2611810

Hoebel K, Bridge CP, Ahmed S, Akintola O, Chung C, Huang R et al. Is this good enough on expert perception of brain tumor segmentation quality. In Mello-Thoms CR, Mello-Thoms CR, Taylor-Phillips S, editors, Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment. SPIE. 2022. 120350P. (Progress in Biomedical Optics and Imaging - Proceedings of SPIE). doi: 10.1117/12.2611810

Hoebel, Katharina ; Bridge, Christopher P. ; Ahmed, Sara et al. / Is this good enough on expert perception of brain tumor segmentation quality. Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment. editor / Claudia R. Mello-Thoms ; Claudia R. Mello-Thoms ; Sian Taylor-Phillips. SPIE, 2022. (Progress in Biomedical Optics and Imaging - Proceedings of SPIE).

@inproceedings{703099b5475c460094c20c86944597e9,

title = "Is this good enough on expert perception of brain tumor segmentation quality",

abstract = "The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.",

keywords = "deep learning, inter-rater variability, quality assessment, segmentation",

author = "Katharina Hoebel and Bridge, {Christopher P.} and Sara Ahmed and Oluwatosin Akintola and Caroline Chung and Raymond Huang and Jason Johnson and Albert Kim and {Ina Ly}, K. and Ken Chang and Jay Patel and Marco Pinho and Batchelor, {Tracy T.} and Bruce Rosen and Elizabeth Gerstner and Jayashree Kalpathy-Cramer",

note = "Publisher Copyright: {\textcopyright} 2022 SPIE. All rights reserved.; Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment ; Conference date: 21-03-2022 Through 27-03-2022",

year = "2022",

doi = "10.1117/12.2611810",

language = "English (US)",

series = "Progress in Biomedical Optics and Imaging - Proceedings of SPIE",

publisher = "SPIE",

editor = "Mello-Thoms, {Claudia R.} and Mello-Thoms, {Claudia R.} and Sian Taylor-Phillips",

booktitle = "Medical Imaging 2022",

}

TY - GEN

T1 - Is this good enough on expert perception of brain tumor segmentation quality

AU - Hoebel, Katharina

AU - Bridge, Christopher P.

AU - Ahmed, Sara

AU - Akintola, Oluwatosin

AU - Chung, Caroline

AU - Huang, Raymond

AU - Johnson, Jason

AU - Kim, Albert

AU - Ina Ly, K.

AU - Chang, Ken

AU - Patel, Jay

AU - Pinho, Marco

AU - Batchelor, Tracy T.

AU - Rosen, Bruce

AU - Gerstner, Elizabeth

AU - Kalpathy-Cramer, Jayashree

PY - 2022

Y1 - 2022

N2 - The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.

AB - The performance of Deep Learning (DL) segmentation algorithms is routinely determined using quantitative metrics like the Dice score and Hausdorff distance. However, these metrics show a low concordance with humans perception of segmentation quality. The successful collaboration of health care professionals with DL segmentation algorithms will require a detailed understanding of experts assessment of segmentation quality. Here, we present the results of a study on expert quality perception of brain tumor segmentations of brain MR images generated by a DL segmentation algorithm. Eight expert medical professionals were asked to grade the quality of segmentations on a scale from 1 (worst) to 4 (best). To this end, we collected four ratings for a dataset of 60 cases. We observed a low inter-rater agreement among all raters (Krippendorff s alpha: 0.34), which potentially is a result of different internal cutoffs for the quality ratings. Several factors, including the volume of the segmentation and model uncertainty, were associated with high disagreement between raters. Furthermore, the correlations between the ratings and commonly used quantitative segmentation quality metrics ranged from no to moderate correlation. We conclude that, similar to the inter-rater variability observed for manual brain tumor segmentation, segmentation quality ratings are prone to variability due to the ambiguity of tumor boundaries and individual perceptual differences. Clearer guidelines for quality evaluation could help to mitigate these differences. Importantly, existing technical metrics do not capture clinical perception of segmentation quality. A better understanding of expert quality perception is expected to support the design of more human-centered DL algorithms for integration into the clinical workflow.

KW - deep learning

KW - inter-rater variability

KW - quality assessment

KW - segmentation

UR - http://www.scopus.com/inward/record.url?scp=85131881240&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85131881240&partnerID=8YFLogxK

U2 - 10.1117/12.2611810

DO - 10.1117/12.2611810

M3 - Conference contribution

AN - SCOPUS:85131881240

T3 - Progress in Biomedical Optics and Imaging - Proceedings of SPIE

BT - Medical Imaging 2022

A2 - Mello-Thoms, Claudia R.

A2 - Taylor-Phillips, Sian

PB - SPIE

T2 - Medical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment

Y2 - 21 March 2022 through 27 March 2022

ER -

Is this good enough on expert perception of brain tumor segmentation quality

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this