Introduction

To date, the FDA has approved 950 medical devices driven by artificial intelligence and machine learning (AI/ML) for potential use in clinical settings1. Most recently, the FDA has launched the Medical Device Development Tools (MDDT) program, which aims to “facilitate device development, timely evaluation of medical devices and promote innovation”, with the Apple Watch being the first approved device for this regulatory process2,3. As AI/ML studies begin to translate to clinical environments, it is crucial that end users can evaluate the applicability of devices to their unique clinical settings and assess sources of bias and risk.

One definition of algorithmic bias in the context of AI/ML health systems is instances when an algorithm amplifies inequities and results in poor healthcare outcomes4. Some defined sub-categories of algorithmic bias are listed in Box 15,6.

Despite the rise in awareness of algorithmic bias and its potential implications on the generalizability of AI/ML models7, there is a paucity of standardized data reporting by regulatory bodies including the FDA that provide reliable and consistent information on the development, testing, and training of algorithms for clinical use. This limits accurate analysis and evaluation of algorithmic performance, particularly in the context of under-represented research groups such as ethnic minorities, children, maternal health patients, patients with rare diseases, and those from lower socioeconomic strata. Deploying devices that cannot be transparently evaluated by end users may increase health disparity and is particularly relevant in the context of emerging clinical trials and real-world deployment8. To date, there has been limited review of this published data.

Here, we investigate AI-as-medical-device Food and Drug Administration (FDA) approvals from 1995–2023 to examine the contents, consistency, and transparency in FDA reporting of market-approved devices with a focus on bias. We focus our study on the limited published FDA data and associated papers, in the format of a scoping review.

Results

Distribution of device approval across clinical specialties

A total of 692 SSEDs of FDA-approved AI-enabled medical devices/software were analyzed. There was a steady increase in annual FDA approvals for AI-enabled medical devices with a mean of 7 between 1995 and 2015, increasing to 139 approvals in 2022 (Fig. 1). The regulatory class of each device included in the study was determined and categorized according to the United States Food and Drug Administration (FDA) classification system. Only 2 (0.3%) of the devices belonged to the regulatory Class III (devices posing the highest risk), while the vast majority (99.7%) of the devices belonged to Class II (devices whose safety and underlying technology are well understood and therefore considered to be lower risk)9.

Fig. 1: Trends in FDA licensing of AI/ML-enabled medical devices.
figure 1

The blue bars show the number of Food and Drug Administration (FDA) approvals for medical devices from 1995 to 2023. The red line plot shows the number of FDA approvals for AI/ML-enabled medical devices that are licensed for children. Total number of FDA approvals for AI/ML-enabled medical devices have increased steeply since 2016. FDA approvals for children-licensed devices have increased steadily since 2018.

Table 1 shows the distribution of 408 approved devices across organ systems. The top three organ systems represented amongst approved medical devices are the circulatory (20.8%), nervous (13.6%), and reproductive (7.2%). The least represented are the urinary (1.2%) and endocrine (0.7%) systems (Table 1). Each device in the FDA database is classified under a particular medical specialty (Fig. 2). The FDA classification shows that the most represented medical specialty is Radiology (532 approvals; 76.9%) with the fewest approvals in Immunology, Orthopedics, Dental Health, Obstetrics, and Gynecology (Fig. 2 and Table 2). A total of 284 (40.1%) approved devices could not be categorized to an organ system either because (1) the clinical indication was not specific to one system or because (2) the function of the device cuts across multiple organ systems (e.g., whole-body imaging system/software). As such, there are some differences between the categories of organ system and medical specialty. For instance, 70 (10.1%) of the devices are classified by the FDA under the cardiovascular field despite 144 (20.8%) approvals specific to the circulatory system (Tables 1 and 2).

Table 1 Distribution of FDA-approved AI-enabled medical devices by organ system
Fig. 2: FDA-assigned medical specialty.
figure 2

Distribution of medical specialties represented in 692 FDA approvals between 1995 and 2023.

Table 2 Primary medical specialty associated with FDA approval

Reporting data on statistical parameters and post-market surveillance

Indication for use of the device was reported in most (678; 98.0%) SSEDs (Fig. 3a) and 487 (70.4%) SSEDs contained a pre-approval performance study. However, 435 (62.8%) provided no data on the sample size of the subjects. Although 319 (46.1%) provided comprehensive detailed results of performance studies including statistical analysis, only 13 (1.9%) of them included a link to a scientific publication with further information on the safety and efficacy of the device (Fig. 4). Only 219 (31.6%) SSEDs provided data on the underlying machine-learning technique. Only 62 device documents (9.0%) contained a prospective study for post-market surveillance. Fourteen (2.0%) SSEDs addressed reporting of potential adverse effects of medical devices on users.

Fig. 3: Reported information on intended subjects.
figure 3

Number of FDA summary documents providing relevant sociodemographic information about intended subjects of medical device. a Only 14 (2.0%) summary documents lacked clearly stated indications for the use of the medical device. b The majority (667; 96.4%) of the documents did not provide any information about race/ethnicity of intended subjects and how this may impact the application of the device. c The majority (686; 99.1%) of the documents did not provide any information about socioeconomic context of intended subjects.

Fig. 4: Accessibility to publications supporting safety, efficacy, and transparency.
figure 4

Only few summary documents provide a means of accessing published scientific publication that validates the use of the medical device.

Race, ethnicity, and socioeconomic diversity

Patient demographics in algorithmic testing data were only specified in 153 (22.1%) SSEDs, with 539 (77.9%) not providing any demographic data (Fig. 3b). Only 25 (3.6%) provided information on the race and/or ethnicity of tested or intended users. Socioeconomic data on tested or intended users were provided for only 6 (0.9%) of devices (Fig. 3c).

Age diversity

There were 134 (19.4%) SEEDs with available information on the age of the intended subjects. Upon examining age diversity in approved devices, the first FDA approval for a device licensed for children was in 2015. Between 2015 and 2022, the annual FDA approvals for the pediatric age group steadily increased from 1 to 24 in total. Despite this rise, the proportion of pediatric-specific approvals relative to the total approvals (for adults and pediatrics combined) has remained low, fluctuating between 0.0% and 20.0% (Fig. 1 and Table 3). Although 4 (0.6%) devices were exclusively developed for children, we found 65 more devices that have been approved for use in both adult and pediatric populations, thus bringing the total number of approvals for the pediatric population to 69 (10.0%). Testing and validation of devices in children and adults was reported in only 134 (19.4%) SSEDs (Fig. 5a, b). The distribution of devices for children (n = 69) across medical specialties falls under just 5 categories, following a similar pattern as earlier observed for the entire population with lead representation in the fields of Radiology (72.5%; n = 69), Cardiovascular health (14.4%; n = 69) and Neurology (10.1%; n = 69) (Fig. 5c). There were only three (0.4%) approved devices that focused exclusively on geriatric health.

Table 3 Annual trends in FDA approvals for AI-enabled medical devices and software for children (2015–2023)
Fig. 5: FDA approvals for pediatrics-licensed medical devices.
figure 5

The proportion of total FDA approvals a licensed for children is 10.0% b tested and validated in both children and adults is 19.4%. c Radiology has the highest representation among FDA-approved devices for children.

Sex diversity

When examining sex reporting transparency, there were a total of 39 (5.6%) approvals exclusively for women’s health, 36 of them focusing on the detection of breast pathology. The remaining three were designed to aid cervical cytology; determine the number and sizes of ovarian follicles; and perform fetal/obstetrics ultrasound. Of the 10 (1.5%) devices that were exclusively for men, eight of them were indicated in diagnostic and/or therapeutic procedures involving the prostate, while the remaining two were for seminal fluid analysis.

Discussion

Our study highlights a lack of consistency and data transparency in published FDA AI/ML approval documents which may exacerbate health disparities. In a similar study examining 130 FDA-approved AI medical devices between January 2015 and December 2020, 97% reported only retrospective evaluations; prospective studies did not evaluate high-risk devices; 72% did not publicly report whether the algorithm was tested on more than one site; and 45% did not report basic descriptive data such as sample size10,11. A lack of consistent reporting prevents objective analysis of the fairness, validity, generalizability, and applicability of devices for end users. As our results describe, only 37% of device approval documents contained information on sample size. As the clinical utility of algorithmic data is limited by data quantity and quality12, a lack of transparency in sample size reporting significantly limits the accurate assessment of the validity of performance studies, and device effectiveness13.

Only 14.5% of devices provided race or ethnicity data. Recent literature strongly emphasizes the risks of increasing racial health disparity through the propagation of algorithmic bias14,15,16. A lack of racial and ethnic profiling in publicly available regulatory documents risks further exacerbating this important health issue17,18. The FDA has recognized the potential for bias in AI/ML-based medical devices and has initiated action plans (“Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan”) in January 2021 to address these concerns19,20. However, despite these efforts, our study highlights reporting inconsistencies that may continue to propagate racial health disparities21. In light of these results, there is a pressing need for transparent and standardized regulatory frameworks that explicitly consider racial diversity in the evaluation and reporting of AI/ML medical devices22. Other strategies to mitigate racial bias may include adopting adversarial training frameworks and implementing post-authorization monitoring to ensure AI/ML devices perform equitably across all patient demographics23,24.

While AI/ML presents potential opportunities to reduce socioeconomic disparity in health, a lack of representation of target users across varied economic strata risks the propagation of health disparity in higher and lower-income groups25. As with other clinical research domains, a lack of representation of lower socioeconomic groups including those in remote and rural areas, risks neglect of those most likely to benefit from improved access to healthcare26,27. Our data shows that only 0.6% of approved devices contained specific data detailing the socioeconomic striate of users in testing and/or algorithmic training datasets. This data renders it difficult to predict the potential clinical and financial impacts of approved medical devices on economic population subsets. Furthermore, a lack of socioeconomic data prevents accurate and robust cost-effectiveness analyses that may significantly impact the availability and impact of medical algorithms or devices28,29. Studies have underscored disparities rooted in socioeconomic factors, impacting the performance of AI/ML technologies4,30,31. Initiatives promoting diversity in data collection and consideration of model performance across socioeconomic groups are paramount and must be incorporated in the assessment of market approval for emerging technologies32.

With only 19.4% of devices providing information on the age of intended device users, our study suggests that the evaluation and approval process of medical AI devices by the FDA lacks comprehensive data on age diversity. Recent literature across specialties demonstrates differential performances in algorithms trained on adult or pediatric data33,34. As an example, a study exploring echocardiogram image analysis suggested that adult images could not be appropriately generalized to pediatric patients and vice versa35. A lack of transparent age reporting, therefore, risks propagating age-related algorithmic bias, with potential clinical, ethical, and societal implications on the target population33,36. Mitigating age bias requires a concerted effort to ensure that training and testing datasets appropriately match intended users. Further, with only 0.6% of devices approved specifically for the pediatric age group, our findings identify equity gaps in the representation of children in AI/ML market-approved devices37,38. Medical devices that have demonstrated their inclusion in pediatric patient populations include the MEDO ARIA software, which assists in the image-based assessment of developmental hip dysplasia (DDH) in infants aged 0 to 12 months39. The EarliPoint System is recommended for developmental disability centers to aid in diagnosing and assessing Autism Spectrum Disorder (ASD) in children aged 16 to 30 months, and the Cognoa ASD Diagnosis Aid, for diagnosing ASD in children aged 18 to 72 months40,41.

With our findings showing that only 0.4% of approved devices cater specifically to geriatric health needs, specific considerations should be considered for the older adult population. Despite having the highest proportion of healthcare utilization, geriatric patients are traditionally underrepresented in clinical research42. A recent WHO ethical guidance document outlines the potential societal, clinical, and ethical implications of ageism in medical research, and describes the lack of geriatric representation as a health hazard in light of aging populations43. Initiatives such as the National Institutes of Health (NIH) Inclusion Across the Lifespan policy aim to promote the participation of older adults in research studies, which may help equitize the potential impacts of algorithmic development for this population, considering unique ethical and clinical considerations44,45. Similar to considerations for children, we propose that regulatory bodies encourage market approval documents to make clear intentions to test and train on a geriatric population and ensure that appropriate validation methods are in place to ensure the appropriate generalization of model outputs to specific geriatric health needs46,47,48,49. Examples of medical devices representing the geriatric population include NeuroRPM, designed to quantify movement disorder symptoms in adults aged 46 to 85 with Parkinson’s disease50. NeuroRPM’s suitability for clinic and home environments facilitates remote visits for patients with limited access to in-person care50. Another device, icobrain, automates labeling, visualization, and volumetric quantification of brain structures in potential dementia patients51. For osteoarthritis, the Knee OsteoArthritis Labeling Assistant (KOALA) software measures joint space width and radiographic features on knee X-rays, aiding in risk stratification and highlighting the importance of preventative screening in geriatrics52.

Our study also examined variations in the representation of different medical specialties among approved medical devices. Specialties most commonly represented include Radiology, Cardiology, and Neurology1. Promoting clinical equity requires a more balanced representation of specialties and disease systems in digital innovation. Whilst we appreciate that AI/ML research is limited by data availability and data quality, industry, academia, and clinicians must advocate for equality of innovation amongst specialties, to include a broad range of conditions and patient populations in medical device development and testing that may potentially benefit53. As the FDA is a US-based regulator, our review does not examine the representation of specialties or conditions outside the US, in particular in Low- and Middle-Income Countries (LMICs) which contain over 80% of the global burden of disease54,55,56. Many countries do not have the regulatory capacity to release approval documentation, and thus future studies must incorporate international data availability, collaboration, and cohesion57. Regulatory bodies both within and outside the USA must attempt to align technological development with key priorities in national and global disease burden to promote global equity58.

Our results showed that transparency in study enrollment, study design methodology, statistical data, and model performance data were significantly inconsistent amongst approved devices. While 70.4% of studies provided some detail on performance studies before market approval, only 46.1% provided detailed results of the performance studies. In 62.9% of devices, there was no information provided on sample size. Transparency is crucial in addressing the challenges of interpretability and explainability in AI/ML systems, and our current findings suggest that evaluation cannot be comprehensively conducted across approved FDA devices59. Models that are transparent in their decision-making process (interpretability) or those that can be elucidated by secondary models (explainability) are essential for validating the clinical relevance of any outcomes and ensuring that devices that may be incorporated in clinical settings and thus enhanced transparency must be incorporated in future approvals22,60. Further ethical considerations encompass a range of issues, including patient privacy, consent, fairness, accountability, and algorithmic transparency61. Including ethics methods in both study protocols and future regulatory documents may minimize privacy concerns arising from the potential misuse, and increase end-user confidence62,63.

Only 142 (20.5%) of the reviewed devices provide statements on potential risks to end users. Further, only 13 (1.9%) approval documents included a corresponding published scientific validation study, providing evidence of their safety and effectiveness. Underreporting of safety data in approved devices limits the ability of end users to determine generalizability, effectiveness, cost-effectiveness, and medico-legal complexities that may occur from device incorporation64. It is therefore paramount that regulatory bodies such as the FDA advocate for a mandatory release of safety data and considerations of potential adverse events. One example of an approved device reporting adverse effects is the Brainomix 360 e-ASPECTS, a computer-aided diagnosis (CADx) software device used to assist the clinician in the characterization of brain tissue abnormalities using CT image data65. Its safety report highlights some of the potential risks of incorrect scoring of the algorithm, the potential misuse of the device to analyze images from an unintended patient population, and device failure.

Box 2 details some recommendations that may be adopted by the FDA and similar regulatory bodies internationally to reduce the risk of bias and health disparity in AI/ML.

While the FDA’s Artificial Intelligence/Machine Learning (AI/ML) Action Plan outlines steps to advance the development and oversight of AI/ML-based medical devices20, including initiatives to improve transparency, post-market surveillance, and real-world performance monitoring66, our study highlights that there remain several clinically relevant inconsistencies in market approval data that may exacerbate algorithmic biases and health disparity.

Poor data transparency in the approval process limits some of the conclusions that can be reliably drawn in this study and limits quality assessment of post-market performance and real-world effectiveness. The paucity of sociodemographic data provided in the SSEDs begets the question of whether applicants were failing to track sociodemographics or if they were simply failing to report them. The FDA SSED template67 does stipulate disclosure of risks and outcomes that may be impacted by sex, gender, age, race, and ethnicity. Thus, we can only justifiably assume that based on what is available and accessible, the paucity of the subgroup analysis data results from the failure of applicants to track the sociodemographics rather than an overall failure of the application process to capture the relevant information. However, it is clear that devices are approved despite not having all of this information available, making the case also for the potential failure of enforcement of these metrics before approval. Considering that most companies do not publish their post-market outcome data (only 1.9% have published data available) and are currently not mandated to do so, our findings are limited by what is accessible. This further re-emphasizes the argument for rigorous and more transparent regulation by bodies such as the FDA to protect end consumers and enhance post-market evaluation and the assessment of real-world effectiveness.

The authors also acknowledge that as this review focuses only on market-approved devices by the FDA in the United States, the results may not be universally generalizable. However, the authors do believe that the results and concepts highlighted in this scoping review are globally relevant. They hope that this paper can form the basis of further studies that focus on devices in varied settings and that this paper will motivate greater global data transparency to promote further health equity in emerging technologies. The authors also note that in recent months, there have been additional AI/ML market approved devices added to the 510k database that have not been included in this evaluation. While this is a limitation, the authors believe that the rapidly accruing number of devices makes the findings of this paper further relevant, demanding quick regulatory review and action.

The ramifications of inadequate demographic, socioeconomic, and statistical information in the majority of 510(k) submissions to the FDA for AI/ML medical devices approved for clinical use are multifaceted and extend across societal, health, legal, and ethical dimensions12,33,68. Addressing these informational gaps is imperative to ensure the responsible and equitable integration of AI/ML technologies into clinical settings and the appropriate evaluation of demographic metrics in clinical trials. Additional focus must be given to under-represented groups who are most vulnerable to health disparities as a consequence of algorithmic bias33,69.

Methods

Based on the intended use, indications for use, and associated risks, the FDA broadly classifies devices as class I (low-risk), class II (medium-risk), and class III (high-risk). Class I and II devices are often cleared through the 510(k) pathway in which applicants submit a premarket notification to demonstrate safety and effectiveness and/or substantial equivalence of their proposed device to a predicate device. Class III (high-risk) devices are defined as those that support human life, prevent impairment of health, or pose potential unreasonable health risk(s)69. Evaluation of this class of devices follows the FDA’s most rigorous device approval pathway known as premarket approval (PMA)70.

The Food, Drug, and Cosmetic Act subparagraph 520(h)(1)(A) requires that a document, known as a Summary of Safety and Effectiveness Data (SSED) be made publicly available following FDA approval. SSEDs are authored by applicants using a publicly accessible template provided by the FDA67. The document intends to provide a balanced summary of the evidence for the approval or denial of the application for FDA approval. To be approved, it is required that the probable benefits of using a device outweigh the probable risks. The studies highlighted in the SSED should provide reasonable evidence of safety and effectiveness67.

Thus, we conducted a scoping review of AI-as-a-medical-device approved by the FDA between 1995 and 2023, using FDA Summary of Safety and Effectiveness Data (SSED). This scoping review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR) guidelines71. A completed PRISMA-ScR checklist is included in Supplementary Table 1. A protocol was not registered.

We included all SSEDs of FDA-approved AI/ML-enabled medical devices between 1995 and 2023 made publicly available via https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices.1 Each SSED was reviewed by an expert in computer science, medicine, or academic clinical research who identified, extracted, and entered relevant variables of interest (Supplementary Table 2). Data was then computed into a Microsoft Excel spreadsheet. Counts and proportions of each variable were generated using Microsoft Excel. The spreadsheet and analysis worksheet on Microsoft Excel have been made available publicly via https://zenodo.org/records/13626179.

Variables of interest were determined per the Consolidated Standards of Reporting Trials - Artificial Intelligence (CONSORT-AI) extension checklist which is a guideline developed by international stakeholders to promote transparency and completeness in reporting AI clinical trials72. Equivocal or unclear information identified in each SSED was then evaluated in consensus.

Primary outcome measures included frequency of race/ethnicity reporting, age reporting, and availability of sociodemographic data of the algorithmic testing population provided in each approval document. Secondary outcomes evaluated the representation of various medical specialties, organ systems, and specific patient populations such as pediatric and geriatric in approved devices.