Cystoscopy is a common diagnostic procedure for hematuria evaluation, bladder cancer surveillance, and guidance for resection of suspicious bladder lesions. Similar to other endoscopic procedures, standard white light cystoscopy is limited by diagnostic accuracy, image quality, and operator dependency. Artificial intelligence is emerging as a potential tool for clinical decision support for endoscopy and endoscopic surgery. However, current deep learning methods for cystoscopy are based on highly curated individual images and do not consider the temporal correlation among the frames within a video sequence, thereby limiting their performance in real world clinical settings. Herein we propose SEQ-3D, a sequential deep learning model for cystoscopic video analysis that considers both the short- and long-range relationships among the frames and the spatial information in each frame. Model validation was performed using a benchmark cystoscopy video dataset derived from 60 patients and 163 pathologically confirmed bladder regions of interest (ROI) consisted of representative cancerous bladder tumors and cancer-mimicking benign lesions. The full-length videos (216,870 frames) were annotated by expert clinicians and divided into distinct frame sequences defined as scenes. SEQ-3D outperforms existing sequential deep learning methods and achieved a per-ROI accuracy of 100%, per-scene sensitivity of 93.1%, and per-scene specificity of 83.3%, exhibiting balanced performance at detecting a wide variety of bladder ROI. Our findings are promising for deployment in prospective clinical settings and can be extrapolated to endoscopic applications in other organ systems.
|