3D Multi-Modal Foundation Model Advances OCT Imaging

Vision loss caused by retinal diseases remains a pervasive global health challenge, profoundly affecting millions and ranking among the leading causes of blindness and visual impairment worldwide. The complexity of retinal diseases demands sophisticated diagnostic tools capable of capturing the intricate structural abnormalities occurring within the retina’s layered architecture. Optical coherence tomography (OCT), a non-invasive imaging modality, has emerged as an indispensable technology in ophthalmology by providing high-resolution, cross-sectional images that reveal the three-dimensional microstructure of the retina. Despite the wealth of information OCT provides, the challenge remains to efficiently and comprehensively analyze this volumetric data to advance early detection, diagnosis, and prognosis of retinal disorders.

OCT imaging excels in revealing detailed retinal morphology, but conventional computational models often treat these volumetric datasets as collections of isolated two-dimensional slices or neglect inter-slice contextual information. This piecemeal approach can lead to information loss and insufficient exploitation of the intrinsic three-dimensional continuity of OCT volumes. Additionally, retinal diagnostics commonly rely on multiple complementary imaging modalities beyond OCT, including fundus autofluorescence (FAF) and infrared retinal imaging (IR), which provide diverse functional and structural perspectives. Until recently, integrating these diverse yet interrelated imaging sources into unified analytical frameworks has been an unfulfilled frontier, constraining the ability to generate holistic and accurate diagnostic models.

In a groundbreaking breakthrough reported in Nature Biomedical Engineering, Liu et al. introduce OCTCube-M, a pioneering three-dimensional multi-modal foundation model designed to capitalise on the full spatial information contained within OCT volumes while seamlessly integrating additional retinal imaging modalities. This innovative framework signifies a paradigm shift in retinal image analysis by harnessing a multi-modal contrastive learning technique dubbed COEP, optimized to align and synergize 3D OCT data with two-dimensional en face (EF) images and other imaging formats. Through this approach, the authors address critical gaps in current retinal imaging analytics, laying the foundation for robust automated systems with unprecedented diagnostic and prognostic capabilities.

The architecture of OCTCube-M is centered around three distinct model variants that build upon one another’s complexity and data inputs. OCTCube represents the foundational uni-modal model, pretrained on an extensive cohort of 26,605 volumetric OCT scans comprising approximately 1.62 million individual 2D slices. This vast amount of training data enables the model to learn rich multi-scale spatial features that underpin the structural heterogeneity of healthy and diseased retinas. Moving beyond single-modality analysis, OCTCube-IR incorporates paired infrared retinal images, leveraging 26,685 matched OCT-IR pairs to refine cross-modality representations and facilitate integrated interpretation. Finally, OCTCube-EF further expands the framework with tri-modal learning by including over 400,000 2D en face retinal images alongside more than 4 million OCT slices, targeting complex prognostic applications such as quantifying the growth rate of geographic atrophy.

One of the crowning achievements of OCTCube is its consistent state-of-the-art performance across eight major retinal diseases, including diabetic retinopathy, age-related macular degeneration (AMD), and retinal vein occlusion, among others. The model’s 3D native learning approach preserves spatial continuity across slices, enabling it to detect subtle pathologies that often elude 2D slice-based methods. More significantly, OCTCube demonstrates extraordinary generalization capabilities, proving robust when deployed across different clinical cohorts, imaging devices, and even disparate imaging modalities. Such robustness is critical for real-world clinical applicability, ensuring that AI-driven diagnostic tools maintain reliability beyond controlled training environments.

The OCTCube-IR model capitalizes on the synergy between OCT’s detailed volumetric data and IR imaging’s enhanced visualization of retinal vasculature and pigmentation. By jointly analyzing these modalities, OCTCube-IR can perform accurate cross-modal retrieval, enabling seamless matching of patient data even when one modality is missing or imperfect. This integration enhances diagnostic confidence and opens pathways for novel clinical workflows that leverage multi-dimensional imaging data. Moreover, the combined analysis paves the way for detecting mixed phenotypes and subtle disease markers that are better characterized when viewed through multiple optical lenses.

OCTCube-EF represents the zenith of multi-modal integration by combining volumetric OCT with en face imaging to tackle the demanding challenge of predicting geographic atrophy progression—a key vision-threatening feature of advanced AMD. Trained on an unparalleled dataset pool derived from six multicenter clinical trials spanning 23 countries, OCTCube-EF excels in quantifying and forecasting disease progression rates across diverse patient populations. This capability holds promise for personalized medicine, enabling clinicians to tailor interventions and monitor therapeutic efficacy more precisely in clinical trial contexts and routine care.

The development of OCTCube-M vividly illustrates the transformative power of contrastive learning-based multimodal fusion strategies in medical imaging. By learning aligned feature representations across divergent data types, COEP facilitates a common embedding space that respects individual modality strengths while enabling cross-talk and integrated reasoning. This technical advancement not only improves performance metrics but also enhances interpretability—a crucial factor in gaining clinical trust as it aids ophthalmologists in correlating AI insights with known pathophysiological bases seen across modalities.

Furthermore, the sheer scale and diversity of the training data underpinning OCTCube-M constitute one of the largest and most comprehensive retinal imaging repositories ever assembled. This extensive dataset diversity undergirds the models’ generalizability and resilience to variations introduced by demographic, device, or protocol differences, thus catalyzing the translation of research prototypes into clinically deployable tools. The foundation model philosophy embodied here—emphasizing pretraining on vast heterogeneous datasets before fine-tuning—mirrors successful strategies in natural language processing and computer vision, marking a pivotal step in ophthalmic AI.

In addition to its clinical implications, OCTCube-M sets elevated standards for future research in retinal imaging and computational ophthalmology. By demonstrating effective strategies for integrating volumetric and planar imaging data, it invites the exploration of other combinations of retinal image modalities, such as fluorescein angiography or adaptive optics scanning laser ophthalmoscopy. Moreover, the multi-modal contrastive learning framework is broadly applicable beyond ophthalmology, suggesting pathways to revolutionize imaging diagnostics across medical specialties reliant on heterogeneous imaging data.

The potential impact of OCTCube-M extends beyond diagnostic accuracy to inform clinical decision-making, patient stratification, and trial design. The ability to accurately predict disease progression trajectories empowers clinicians and researchers with actionable insights to optimize treatment plans and evaluate novel therapies more efficiently. In geographic atrophy, for example, objective biomarkers derived from OCTCube-EF could accelerate the development of disease-modifying drugs by providing reliable surrogate endpoints, thus addressing a critical unmet need in retinal therapeutics.

As the technology matures, real-world integration of OCTCube-M into clinical workflows will require careful consideration of usability, interoperability, and regulatory compliance. Its modular design allows adaptability across different health systems and imaging platforms, but challenges remain in standardizing input data formats and ensuring patient privacy during large-scale model deployment. Collaborative efforts among clinicians, engineers, and regulatory agencies will be essential in overcoming these hurdles and translating technological promise into routine practice.

In conclusion, the introduction of OCTCube-M marks a monumental leap in retinal imaging analytics through its innovative 3D multi-modal learning paradigm. By fully harnessing the rich structural details of OCT alongside complementary imaging modalities, it achieves a holistic view of retinal pathology that eclipses previous uni-modal approaches. This advance is poised to revolutionize how retinal diseases are diagnosed, monitored, and ultimately treated, heralding a new era of precision ophthalmology informed by sophisticated AI-driven insights.

Looking forward, the principles and frameworks established here will undoubtedly inspire future developments in multi-modal medical AI, driving innovation in complex disease understanding and management. OCTCube-M exemplifies the cutting edge of AI’s synergy with medical imaging, where deep learning models do not merely analyze pixels but serve as integral partners in unraveling human biology and improving patient outcomes in visionary new ways.

Subject of Research:
Three-dimensional multi-modal foundation models for integrated analysis of retinal imaging data, including optical coherence tomography, en face imaging, and infrared retinal imaging for diagnosis and prognosis of retinal diseases.

Article Title:
A three-dimensional multi-modal foundation model for optical coherence tomography.

Article References:
Liu, Z., Xu, H., Woicik, A. et al. A three-dimensional multi-modal foundation model for optical coherence tomography. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01662-2

Image Credits:
AI Generated

DOI:
https://doi.org/10.1038/s41551-026-01662-2

Tags: 3D multi-modal foundation modeladvanced ophthalmic diagnostic toolscross-sectional retina imagingearly detection of retinal disordersfundus autofluorescence imaginginfrared retinal imaging integrationmulti-modal retinal imaging techniquesoptical coherence tomography imagingretinal disease diagnosisretinal health assessment technologyretinal morphology imagingvolumetric OCT data analysis