This study evaluates the diagnostic performance of a multimodal artificial intelligence (AI) system (AIMD.1) using de-identified medical images and semi-synthetic patient simulations. The study combines retrospective analysis of existing publicly available image datasets with prospective data collection from licensed clinicians who complete diagnostic evaluation tasks. In the One-Shot Vision Differential Evaluation (OSVDE) stage, clinicians review individual de-identified medical images and generate a ranked list of potential diagnoses based solely on visual features. In the Multi-Step Conversational Non-Inferiority Evaluation (MSCNE) stage, clinicians complete diagnostic assessments using semi-synthetic patient simulations derived from de-identified medical images. Clinician performance will be compared with the AI system on the same diagnostic tasks. Human participants consist solely of licensed clinicians who provide diagnostic responses. Medical images and simulated cases are study materials and are not considered study participants. No identifiable patient data are used, and the AI system is evaluated in an offline research environment and is not used for clinical decision-making or patient care.
Artificial intelligence (AI) systems have demonstrated promising capabilities in medical diagnosis; however, rigorous benchmark evaluation is necessary prior to clinical deployment. AIMD.1 is a multimodal AI diagnostic system designed to assist with clinical reasoning through analysis of medical images and conversational diagnostic interactions. This study is a benchmark performance evaluation of AIMD.1, a multimodal AI medical diagnostic system developed by Nolla Health and designed to assist clinical reasoning through analysis of medical images and structured conversational diagnostic interactions. The evaluation is conducted entirely in an offline research environment; the AI system is not used to guide real-world clinical care or patient management during this study. The study is designed as a benchmark performance evaluation prior to any prospective validation involving real patients. The global healthcare system faces a significant workforce shortage, with projections suggesting a deficit of up to 11 million practitioners by 2030. AI systems for medical diagnosis have shown promise in addressing this gap in controlled research settings, but rigorous benchmark validation against clinician-level performance is needed before clinical deployment. This study addresses that need by evaluating AIMD.1 against both established AI benchmark systems and directly against licensed clinicians completing the same diagnostic tasks. The study employs two complementary evaluation stages designed to assess distinct aspects of diagnostic capability: Stage 1 - One-Shot Vision Differential Evaluation (OSVDE): The AI system and clinician participants independently review individual de-identified medical images and generate ranked top-5 differential diagnoses based solely on visual features. The image corpus comprises approximately 11,500-15,000 de-identified images spanning at least 12 medical specialties (Dermatology, Internal Medicine, Otolaryngology, Gynecology, Orthopedics, Pediatrics, Geriatrics, Emergency Medicine, Ophthalmology, Endocrinology, Family Medicine, and others) and 48 disease clusters. Image sources include approximately 14,000 images retrieved from standard search engines (Google and Bing) and open access repositories such as the PMC Open Access Dataset, filtered for Creative Commons and public domain licensing, as well as approximately 1,000 de-identified clinical images provided by Nolla Health under terms of service permitting de-identified use for research purposes, with all images de-identified per HIPAA Safe Harbor standards. All source images undergo random combinations of affine and non-affine transformations (blurring, sharpening, contrast adjustment, color adjustment, pixel shifting, rotation, stretching, Gaussian noise, among others) to produce fundamentally distinct images from the originals while preserving clinically relevant visual features on average. Each transformed image is verified by at least one licensed dermatologist or primary care clinician and relabeled as necessary, with ambiguous or non-clinically relevant images removed from the corpus. This preprocessing pipeline also provides additional de-identification through cropping or masking of potentially identifiable regions, including facial features. Images are localized using disease-name keywords defined in the study's disease ontology and downloaded in standard formats (JPEG, PNG) with associated metadata including ground-truth diagnostic labels and disease categories. Where available, additional metadata such as Fitzpatrick skin type (I-VI) and patient age range (pediatric, adult, geriatric) are recorded to enable subgroup analyses of diagnostic performance across demographic categories. Stage 2 - Multi-Step Conversational Non-Inferiority Evaluation (MSCNE): The AI system and clinicians complete diagnostic tasks using semi-synthetic patient simulations grounded in de-identified medical images. These simulations deliver structured clinical information through a conversational interface across multiple interaction steps, allowing assessment of multi-step diagnostic reasoning that more closely mirrors real-world clinical encounters than single-image evaluation alone. Approximately 380-500 simulated cases are evaluated. Each clinician completes a subset (30-50%) of the simulation cases, and their performance serves as the human benchmark for a formal non-inferiority comparison with the AI system. Approximately 10-30 licensed clinicians will participate in the study. Clinicians must hold active license in at least one of the target medical specialties and must be 18 years of age or older. Clinicians will be recruited via professional networks, institutional contacts, and relevant medical associations. Participation is voluntary. Clinicians will complete diagnostic evaluation sessions remotely using a computer or tablet with a reliable internet connection. Sessions are expected to last approximately 60-90 minutes in aggregate. Clinicians will provide differential diagnoses for subsets of the image and simulation cases. For OSVDE, each clinician reviews approximately 10-30% of the image dataset. Human participants consist solely of clinicians providing diagnostic responses; the image datasets and synthetic cases serve as study materials and are not considered participants. Clinicians are compensated $1.00 per case for the OSVDE one-shot visual evaluation and $10.00 per case for the MSCNE multi-step conversational evaluation. Compensation is for time and effort and is not contingent on diagnostic accuracy. Performance is compared using paired statistical designs where both the AI system and clinicians evaluate overlapping case sets. The AI system is additionally benchmarked against established AI diagnostic systems on the same datasets. Clinician participants will receive a Clinician Information Sheet describing the study purpose, procedures, voluntary nature of participation, data handling practices, and contact information for the research team and IRB prior to participation. Clinicians will indicate their acknowledgment before beginning the evaluation. All images used in the study are de-identified and originate from publicly available sources or datasets that meet de-identification standards. No electronic health record (EHR) data is accessed at any point during the study. No stored or processed image remains fundamentally the same as any source image due to the transformation pipeline, thereby strengthening de-identification safe harbors. Additional preprocessing steps ensure removal of any potentially identifiable information before inclusion in the research dataset. Images are assigned sequential study identifiers (e.g., IMG\_00001) with no linkage to original sources. No code key or crosswalk exists that could enable re-identification. Data is stored on HIPAA-compliant workstations or cloud services (GCP or AWS) with AES-256 encryption at rest, multi-factor authentication, role-based access controls, and all access logged and audited. Data transfers use TLS 1.3 or stronger encryption in transit. Primary outcome measures include Top-1 diagnostic accuracy, defined as the proportion of cases in which the AI system's primary diagnosis matches the reference diagnosis. Secondary outcomes include Top-5 diagnostic accuracy, expected calibration error (ECE), area under the ROC curve (AUC) per disease cluster, per-class sensitivity and specificity across disease categories, and time-to-diagnosis measured in conversational turns for the MSCNE simulated cases. Statistical analysis employs bootstrap resampling (1,000 iterations) for confidence interval estimation, McNemar's test for paired accuracy comparisons, and non-inferiority testing with a pre-specified margin of δ = 5%. With 11,500-15,000 images (approximately 100-500 samples per each of the 48 disease clusters) and an assumed true accuracy of 70%, the design achieves a 95% confidence interval width of approximately ±0.8%, providing sufficient precision to detect meaningful differences from benchmark performance. No interim analyses are planned; all analyses are conducted after complete data collection. Results will be reported in accordance with TRIPOD guidelines for prediction model studies. Clinician responses are recorded using anonymous study identifiers (e.g., CLIN\_001, CLIN\_002) with no link to the clinician's name, institution, or other identifying information. Only aggregate performance results (e.g., group accuracy rates, mean time-to-diagnosis) will be reported. No individual clinician results will be published or shared outside the research team. Demographic data collected from clinicians is limited to specialty, years of experience (in ranges), and practice setting (academic vs. community), recorded in a manner that prevents identification of individual clinicians. The study duration is expected to be approximately six months: dataset cleaning, quality verification, and preprocessing (Month 1); OSVDE evaluation and analysis (Months 2-4); MSCNE evaluation and analysis (Months 2-5); and final analysis, reporting, and manuscript preparation (Month 6). The total study is to be executed in 2026. Publication will include only aggregate and summary-level data; no individual person-level data will be published or deposited in external repositories. This protocol ID 1026 has been verified as Exempt according to 45CFR46.104(d) on 03/10/2026 by Solutions IRB (855) 226-4472 (www.solutionsirb.com).
Study Type
OBSERVATIONAL
Enrollment
30
AIMD.1 (also known as NollaMD agent) is a multimodal artificial intelligence (AI) diagnostic system designed to generate differential diagnoses based on analysis of medical images and structured clinical information. In this study, the system is evaluated using de-identified medical images and semi-synthetic patient simulations under controlled research conditions. The AI system generates ranked diagnostic outputs and associated confidence scores, which are compared with reference diagnoses and clinician performance metrics. The system is evaluated in an offline research environment. AI outputs are not used for clinical decision-making, patient management, or real-world medical care.
Nolla Health (Magic Health Inc.)
New York, New York, United States
RECRUITINGTop-1 Diagnostic Accuracy
Proportion of evaluated cases in which the primary diagnosis generated by the AI Diagnostic System matches the reference (ground truth) diagnosis. Accuracy will be calculated across de-identified medical image cases and semi-synthetic patient simulation cases and compared with clinician performance.
Time frame: At completion of diagnostic evaluations (up to 6 months)
Top-5 Diagnostic Accuracy
Proportion of evaluated cases in which the correct reference diagnosis appears within the top five ranked diagnoses generated by the AI Diagnostic System.
Time frame: At completion of diagnostic evaluations (up to 6 months)
Diagnostic Accuracy of Clinician Participants
Proportion of evaluated cases in which the clinician participant's primary diagnosis matches the reference diagnosis, calculated across assigned image and simulation cases.
Time frame: At completion of diagnostic evaluations (up to 6 months)
Non-Inferiority of AI Diagnostic Accuracy Compared to Clinicians
Difference in Top-1 diagnostic accuracy between the AI Diagnostic System and clinician participants. Non-inferiority will be assessed using a predefined margin of 5%.
Time frame: At completion of diagnostic evaluations (up to 6 months)
Calibration of AI Diagnostic Confidence
Calibration performance of AI-generated diagnostic confidence scores assessed using Expected Calibration Error (ECE).
Time frame: At completion of diagnostic evaluations (up to 6 months)
Area Under the Receiver Operating Characteristic Curve (AUC) and Precision Recall Curve (PRC)
Area under the receiver operating characteristic curve and the Precision Recall Curve for AI diagnostic classification across disease categories.
Time frame: At completion of diagnostic evaluations (up to 6 months)
Time-to-Diagnosis in Conversational Simulations
Number of conversational turns required by the AI system and clinician participants to reach a final diagnosis in semi-synthetic patient simulation cases.
Time frame: At completion of simulation evaluations (up to 6 months)
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.