. 2023 Jan 4;6(1):e2248685. doi: 10.1001/jamanetworkopen.2022.48685

Development of a Machine Learning Model for Sonographic Assessment of Gestational Age

Chace Lee ^1,^✉, Angelica Willis ¹, Christina Chen ¹, Marcin Sieniek ¹, Amber Watters ², Bethany Stetson ², Akib Uddin ¹, Jonny Wong ¹, Rory Pilgrim ¹, Katherine Chou ¹, Daniel Tse ¹, Shravya Shetty ¹, Ryan G Gomes ¹

¹Google Health, Palo Alto, California

²Department of Obstetrics and Gynecology, Northwestern University Feinberg School of Medicine, Chicago, Illinois

Accepted for Publication: November 10, 2022.

Published: January 4, 2023. doi:10.1001/jamanetworkopen.2022.48685

^✉

Corresponding Authors: Chace Lee, MS (chacelee@google.com), and Ryan Gomes, PhD (ryangomes@google.com), Google Health, 3400 Hillview Ave, Palo Alto, CA 94304.

Author Contributions: Mr Lee and Dr Gomes had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Mr Lee and Ms Willis contributed equally to this work. Ms Shetty and Dr Gomes jointly supervised this work.

Concept and design: Lee, Willis, Uddin, Pilgrim, Chou, Shetty, Gomes.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Lee, Willis, Chen, Uddin, Wong, Chou, Shetty.

Critical revision of the manuscript for important intellectual content: Lee, Willis, Chen, Sieniek, Watters, Stetson, Pilgrim, Tse, Gomes.

Statistical analysis: Lee, Willis, Gomes.

Obtained funding: Chou, Tse.

Administrative, technical, or material support: Lee, Willis, Chen, Watters, Stetson, Uddin, Wong, Pilgrim, Tse, Shetty, Gomes.

Supervision: Watters, Pilgrim, Chou, Tse, Shetty, Gomes.

Conflict of Interest Disclosures: Mr Lee reported owning stock in Google, Inc, as part of the standard employee compensation plan and having a patent for Google issued (20220354466). Ms Willis reported owning stock in Google, Inc, as part of the standard employee compensation package. Dr Sieniek reported receiving personal fees from Google, Inc, and owning stock of Alphabet during the conduct of the study. Dr Pilgrim reported owning stock in Google, Inc, as part of the standard employee compensation plan. Dr Tse reported receiving personal fees from Google, Inc, and having a patent for Google, Inc, issued. Dr Gomes reported owning stock in Google, Inc, as part of the standard employee compensation plan. No other disclosures were reported.

Funding/Support: This study was partially funded by Google, LLC. This study was partially funded by the Bill and Melinda Gates Foundation (grants OPP1191684 and INV003266).

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: The conclusions and opinions expressed in this article are those of the authors and do not necessarily reflect those of the Bill and Melinda Gates Foundation.

Data Sharing Statement: See Supplement 2.

Additional Contributions: Yun Liu, PhD (Google Health), provided helpful feedback on the manuscript and was not compensated for this work.

^✉

Corresponding author.

PMCID: PMC9857195 PMID: 36598790

Key Points

Question

Can an artificial intelligence (AI) model predict gestational age with higher accuracy than standard fetal biometry–based estimates by leveraging standard plane ultrasonography images and fly-to ultrasonography videos?

Findings

In this diagnostic study with a test set of 404 participants, all image, video, and ensemble AI models were statistically superior to standard fetal biometry–based gestational age estimates derived from images captured by expert sonographers. The ensemble model had the lowest mean absolute error, with a mean difference of −1.51 days.

Meaning

These findings suggest that AI models have the potential to empower trained ultrasonography operators to estimate gestational age with higher accuracy.

This diagnostic study examines the use of artificial intelligence models to estimate gestational age with high accuracy and reliability, leveraging standard biometry images and fly-to ultrasonography videos.

Abstract

Importance

Fetal ultrasonography is essential for confirmation of gestational age (GA), and accurate GA assessment is important for providing appropriate care throughout pregnancy and for identifying complications, including fetal growth disorders. Derivation of GA from manual fetal biometry measurements (ie, head, abdomen, and femur) is operator dependent and time-consuming.

Objective

To develop artificial intelligence (AI) models to estimate GA with higher accuracy and reliability, leveraging standard biometry images and fly-to ultrasonography videos.

Design, Setting, and Participants

To improve GA estimates, this diagnostic study used AI to interpret standard plane ultrasonography images and fly-to ultrasonography videos, which are 5- to 10-second videos that can be automatically recorded as part of the standard of care before the still image is captured. Three AI models were developed and validated: (1) an image model using standard plane images, (2) a video model using fly-to videos, and (3) an ensemble model (combining both image and video models). The models were trained and evaluated on data from the Fetal Age Machine Learning Initiative (FAMLI) cohort, which included participants from 2 study sites at Chapel Hill, North Carolina (US), and Lusaka, Zambia. Participants were eligible to be part of this study if they received routine antenatal care at 1 of these sites, were aged 18 years or older, had a viable intrauterine singleton pregnancy, and could provide written consent. They were not eligible if they had known uterine or fetal abnormality, or had any other conditions that would make participation unsafe or complicate interpretation. Data analysis was performed from January to July 2022.

Main Outcomes and Measures

Results

Of the total cohort of 3842 participants, data were calculated for a test set of 404 participants with a mean (SD) age of 28.8 (5.6) years at enrollment. All models were statistically superior to standard fetal biometry–based GA estimates derived from images captured by expert sonographers. The ensemble model had the lowest mean absolute error compared with the clinical standard fetal biometry (mean [SD] difference, −1.51 [3.96] days; 95% CI, −1.90 to −1.10 days). All 3 models outperformed standard biometry by a more substantial margin on fetuses that were predicted to be small for their GA.

Conclusions and Relevance

These findings suggest that AI models have the potential to empower trained operators to estimate GA with higher accuracy.

Introduction

Fetal ultrasonography is the cornerstone of prenatal imaging and provides crucial information to guide maternal-fetal care, such as estimated gestational age (GA) and evaluation for fetal growth disorders. Currently, the clinical standard for estimating GA and diagnosing fetal growth disorders is determined through manual acquisition of fetal biometric measurements, such as biparietal diameter, head circumference, abdominal circumference (AC), femur length, or crown-rump length. Numerous regression formulas for GA estimation exist on the basis of different combinations of fetal biometric measurements. The Hadlock formula is one of the most popular formulas and is included with most ultrasonography equipment packages. Previous studies have suggested that although fetal biometric measurements were generally reproducible across operators, there was increased variance later in pregnancy, which is when many key clinical decisions are made.^1,2

The accuracy and efficiency of biometric measurements depend on the skill and experience of the sonographer. Factors like fetal movement and fetal positioning can make it difficult to accurately position the ultrasonography probe to acquire biometry measurements.^1,2 There has been extensive research on using artificial intelligence (AI) systems to assist in estimating GA, typically through automatic estimation of biometric parameters that are then used in the Hadlock formula.^3,4,5,6,7 Recent studies^8,9 show state-of-art accuracy for fetal biometry estimation, where a classification model is used to identify standard ultrasonography planes and a segmentation model is used for fetal biometry estimation from standard ultrasonography planes, and those works focus on fetal biometry estimation to match sonographer performance on biometry measurement, whereas our work focuses on GA estimation without requiring measurement acquisition or use of regression formulas, which gives the opportunity for the model to improve GA estimation accuracy of sonographer assessment. We have recently shown that GA model estimation using ultrasonography videos of predefined sweeps was noninferior to standard fetal biometry estimates.¹⁰

In this study, we further extend the use of ultrasonography videos by developing 3 end-to-end AI models: (1) an image model using fetal ultrasonography images captured by sonographers during biometry measurements; (2) a video model using fly-to videos, which are defined as 5 to 10 seconds of video immediately before image capture; and (3) an ensemble model using both images and fly-to videos. All data were collected retrospectively during standard biometry measurements. All 3 models directly estimate GA, without requiring measurement acquisition or use of regression formulas. To our knowledge, this is the first attempt at using AI on ultrasonography videos captured for standard-of-care Hadlock procedures to predict GA directly for all trimesters without estimating biometry measurements from standard plane images.

Methods

This diagnostic study follows the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline. All study participants provided written informed consent, and the research was approved by the University of North Carolina (UNC) institutional review board and the biomedical research ethics committee at the University of Zambia.

Algorithm Development

We developed 3 deep learning neural network models to predict GA: image, video, and an ensemble approach combining both. The image model generated GA prediction, measured in days, directly from each standard plane fetal biometry image, with image pixels of fixed dimension as input. The video model generated GA prediction directly from fly-to video sequences, with fixed-length sequences of image pixel values as input. The GA model also provided an estimate of its confidence in the estimate for a given video sequence or image. No intermediate fetal biometric measurements were required during training or generated during inference.

Our models were developed and evaluated using data sets prospectively collected as part of the Fetal Age Machine Learning Initiative (FAMLI),¹¹ which collected ultrasonography data from study sites at Chapel Hill, North Carolina (US), and Lusaka, Zambia. The goal of this prospectively collected data set was to accelerate the development of technology to estimate GA. We trained our models on studies from all trimesters with images and videos captured by trained sonographers using standard ultrasonography devices (SonoSite Turbo-M in Zambia; GE Voluson 8 in North Carolina; GE Logiq C3 in Zambia), excluding images and videos captured using low-cost portable ultrasonography device (Butterfly iQ; Butterfly Network) and novice studies, in an effort to enable applicability to device types used in standard of care. The data set consisted of ultrasonography examinations performed by multiple sonographers at each site, and the operators were trained according to normal standards in both countries. The built-in AI was turned off for the study, and the examination was completely operator dependent. Study participants were assigned at random to 1 of 3 data set splits: train, tune, or test. We used the following proportions: 60% train, 20% tune, and 20% test. The tuning set was used for tuning model hyperparameters.

The image model was trained on all crown-rump length standard plane images from first trimester studies (mean [SD] GA, 78.6 [14.0] days; range, 44-97 days) and head circumference, AC, and femur length standard plane images from second trimester (mean [SD] GA, 156.0 [27.4] days; range, 98-195 days) and third trimester (mean [SD] GA, 229.0 [18.1] days; range, 196-258 days) studies. The video model was trained on data that could be acquired in current stand-of-care procedures. Specifically, this included sonographer-acquired blind sweeps¹² (up to 15 sweeps per patient), as well as sonographer-acquired fly-to videos that capture 5 to 10 seconds before the sonographer has acquired standard fetal biometry images. In the FAMLI study, GE machines (Voluson 8 in North Carolina and Logiq C3 in Zambia) were configured to export 5 to 10 seconds of cine capture before the sonographer froze or saved an image.

Cases consisted of multiple fly-to videos and standard biometry images, and our models generated predictions independently for each video sequence or image within the case. For the GA video model, each fly-to video was divided into multiple video sequences, and we then aggregated the predictions to generate a single case-level estimate for GA (described in greater detail later in this article).

Model Architecture

We developed 2 deep learning neural network models to predict GA from standard biometry images (eFigure 1 in Supplement 1) and fly-to videos. The GA regression models used the best obstetric estimate of GA associated with the case as the training label for all video clips within the case. We defined GA prediction as a regression problem in which both image and video models produce an estimate of GA, measured in days.

Our image model was a single instance model trained on all standard plane biometry images for 4 standard anatomy types: crown-rump, fetal head, abdomen, and femur. The model architecture is depicted in eFigure 2 in Supplement 1. An independent GA and variance score was predicted for each image. Our video model was assembled from an inflated 3-dimensional (I3D) convolutional model¹³ and a convolutional recurrent model proposed for blindsweep GA prediction,¹⁰ which used a MobileNetV2¹⁴ feature extractor. The model architecture is depicted in eFigure 3 in Supplement 1. The I3D network and image model used a mean-variance regression loss function,^15,16 which provided an estimate of expected variance by an additional softplus model output.

Both video and image models predicted variance as a confidence estimate, and the GA predictions were aggregated (ensembled) using the inverse weight of their predicted variance score to produce a final GA. From our observations, clips with high confidence showed a clear anatomy view useful for GA prediction, whereas low-confidence examples showed less informative anatomic views. The highest and lowest confidence clip examples from the fly-to video for each type of standard biometry are shown in eFigure 4 in Supplement 1. Details of model architecture, model training, and data preprocessing including pixel spacing scaling are described in the eAppendix in Supplement 1.

Ensemble

To improve model accuracy, we explored different ensemble configurations, ultimately selecting an ensemble of the still image model with the video models described already. Case GA predictions from the still image model, I3D model, and convolutional recurrent model were averaged after the inverse variance weighting procedure to give the final case prediction.

Statistical Analysis

AI GA Estimation Using Ultrasonography Images and Fly-To Videos

The primary analysis outcome for GA was the mean difference in absolute error between the GA model estimate and the clinical standard estimate, with the ground truth GA extrapolated from the initial GA estimated at the initial examination. The ground truth GA at each subsequent visit was calculated as GA at initial examination plus number of days since baseline visit. Statistical estimates and comparisons were computed after randomly selecting 1 study visit per patient for each analysis group, to avoid combining correlated measurements from the same patient.

When conducting pairwise statistical comparisons between model estimate absolute errors (MAEs) and clinical standard absolute errors, we established the a priori superiority comparison: computing whether the upper limit of the 2-sided 95% CI for MAE difference (model MAE minus standard biometry MAE) is strictly below 0; in other words, we used the confidence interval of the difference for statistical inference. Data analysis was performed using open-source Python (version 3.7, Python Software Foundation) tools numpy (version 1.21.5), SciPy (version 1.2.1), and Pandas (version 1.1.5). Machine learning models were developed using the Tensorflow deep learning software package version 1 (Google). We defined the primary outcome (statistical superiority) before evaluating on the test set to avoid multiple comparisons.

In addition, secondary subgroup analyses were performed, with subgroups including pregnancy trimester (first, second, or third), country (US and Zambia), and device manufacturers (SonoSite Turbo-M in Zambia, GE Voluson 8 in North Carolina, and GE Logiq C3 in Zambia). For each subgroup analysis, estimates and comparisons were computed after randomly selecting 1 study visit per patient for patients eligible for the subgroups, to avoid combining correlated measurements from the same patient. Because these are exploratory analyses, we did not adjust the significance threshold to account for multiple comparisons. We conducted the exploratory analyses on the tuning set, the findings held before being evaluated again on the test set.

GA Estimation Compared With Alternative Formulas

The formulas derived in Hadlock et al² are the standard of care in the US and many other countries around the world; however, these formulas, derived from a limited population of 361 middle-class White women in Texas, might not generalize as well as alternative formulas developed with broader, population-based data. For this reason, we compared the predictions on the FAMLI population with 2 additional formulas, Intergrowth-21st and National Institute of Child Health and Human Development (NICHD), that have shown promise as potential alternatives to Hadlock.^17,18

Suspected Fetal Growth Restricted and Large for GA Cases

Biometry-based methods of determining GA are predisposed to underestimation and overestimation of fetal growth restricted (FGR) and large for GA (LGA) fetuses, respectively. To understand how the AI model performed in these situations, we conducted an analysis of accuracy achieved on fetuses that were smaller or larger than expected for their GA according to fetal AC measured for the given population and ground truth gestational week. We conducted the exploratory analyses on the tuning set, the findings held before being evaluated again on the test set. Details of identification of suspected FGR and LGA cases are described in the eAppendix in Supplement 1. Data analysis was performed from January to July 2022.

Results

Clinical Characteristics

Our models were developed and evaluated using data sets prospectively collected as part of the FAMLI study.¹¹ Our evaluation was performed on a test set consisting of patients independent of those used for AI development. The primary test set consisted of 407 women with standard-of-care ultrasonography scans performed by expert sonographers at UNC Healthcare, Chapel Hill, North Carolina, and at community clinics in Lusaka, Zambia. Complete sets of ultrasonography fetal biometry images and fly-to videos data collected with the SonoSite Turbo-M in Zambia (268 studies) or GE Voluson 8 in North Carolina (104 studies) or GE Logiq C3 in Zambia (104 studies) ultrasonography machine were available for 404 of these 407 participants, corresponding to 677 study visits, with mean (SD) participant age of 28.8 (5.6) years at enrollment.

The disposition of test set study participants used in the following analysis are summarized in STARD diagrams (eFigure 5 in Supplement 1). The characteristics of study participants who are included in the test set analyses are shown in eTable 1 and eFigure 6 in Supplement 1. Among study visits conducted by sonographers, 63 women (9.3%) had at least 1 visit during the first trimester (mean [SD] GA, 78.6 [14.0] days; range, 44.0-97.0 days), 235 women (34.7%) had at least 1 visit during the second trimester (mean [SD] GA, 156.0 [27.4] days; range, 98.0-195.0 days), and 379 women (56.0%) had 1 or more visits in the third trimester (mean [SD] GA, 229.0 [18.1] days; range, 196.0-258.0 days). The same inclusion and exclusion criteria were used in both the US and Zambia. However, participants were recruited from patients receiving routine antenatal care, which differs between countries and clinics. Although there may be differences between centers, the intention of this cohort was to be representative of the patients typically seen at each center. There is a small difference in the mean (SD) maternal age in Zambia (28.2 [5.7] years) vs the US (29.6 [5.4] years). Additionally, there are substantially higher rates of HIV in the Zambia group (62 patients [27.8%]) vs the US group (0 patients).

AI GA Estimation Using Ultrasonography Images and Fly-To Videos

GA analysis results are summarized in Table 1. The overall MAE for the image GA model was lower compared with the MAE for the standard fetal biometry estimates (mean [SD] difference, −1.13 [4.18] days; 95% CI, −1.50 to −0.70 days). The upper limit of the 95% CI for the difference in MAE values was negative, indicating statistical superiority of the model.

Table 1. GA Estimation Overall Performance^a.

Estimation method^b	Standard fetal biometry estimates	Ensemble model	Video model	Image model
Mean error (SD), d	−1.44 (6.82)	−0.45 (4.81)	−0.54 (4.84)	−0.29 (5.32)
MAE (SD), d	5.11 (4.73)	3.6 (3.23)	3.63 (3.24)	3.97 (3.54)
MAE difference vs standard fetal biometry, mean (SD) difference [95% CI], d	1 [Reference]	−1.51 (3.96) [−1.90 to −1.10]	−1.48 (4.05) [−1.90 to −1.10]	−1.13 (4.18) [−1.50 to −0.70]

Open in a new tab

Abbreviations: GA, gestational age; MAE, mean absolute error.

^{^a}

There were 404 patients, with a mean (SD) ground truth GA of 192.9 (53.3) days.

^{^b}

MAE and mean error between GA were estimated using the artificial intelligence models and ground truth, and the MAE and mean error between the GA were estimated using the standard fetal biometry ultrasonography procedure and ground truth. One visit by each participant eligible for each subgroup was selected at random. The video model is an ensemble of the inflated 3-dimensional convolutional network and long short-term memory video models. The ensemble model combines predictions from the 2 video models and the image model.

The overall MAEs of the video model and ensemble models were significantly lower compared with the standard fetal biometry; the ensemble model had the lowest MAE (mean [SD] difference, −1.51 [3.96] days; 95% CI, −1.90 to −1.10 days), followed by the video model (mean [SD] difference, −1.48 [4.05] days; 95% CI, −1.90 to −1.10 days) (Figure 1 and eFigure 7 in Supplement 1). For both models, the upper limit of the 95% CI for the difference in MAE values was negative, indicating statistical superiority.

Figure 1. — A, Model and standard fetal biometry estimates of mean absolute error (MAE) vs ground truth GA (4-week GA windows). B, Model and standard fetal biometry estimate absolute error distribution estimate more as gestational age increases. C, Error distributions for ensemble model and standard fetal biometry procedure. The error distribution of standard fetal biometry procedure has longer tails on both sides. D, Paired errors for ensemble model and standard fetal biometry estimates in the same study visit. The errors of the 2 methods exhibit correlation, but the worst-case errors for the ensemble model have a lower magnitude than the standard fetal biometry method.

Subgroup analyses for trimester, countries, and devices are provided in eTable 2 and eTable 3 in Supplement 1. The results show that our models generalized well across countries, devices, and second and third trimesters, with lower MAE compared with the standard fetal biometry estimates. The upper limit of the 95% CI for the difference in MAE values was less than 0.1 day, except for the first trimester, where the smaller sample size for the first trimester data set size broadened the 95% CIs.

GA Estimation Compared With Alternative Formulas

GA analysis results are summarized in Table 2. The ensemble model had the lower MAE compared with NICHD (mean [SD] difference, −1.23 [4.04] days; 95% CI, −1.60 to −0.80 days) and Intergrowth-21st (mean [SD] difference, −2.69 [5.54] days; 95% CI, −3.30 to −2.10 days). The upper limits of the 95% CI for the difference in MAE values were higher, indicating statistical superiority.

Table 2. Comparison Against Alternative Biometry Regression Formulas^a.

Estimation method^b	Hadlock	Intergrowth-21st	NICHD	Ensemble model
Mean error (SD), d	1.61 (7.04)	4.22 (7.31)	0.23 (6.69)	−0.47 (4.88)
MAE (SD), d	5.32 (4.87)	6.35 (5.36)	4.89 (4.57)	3.65 (3.27)
MAE difference vs ensemble model, mean (SD) difference [95% CI], d	−1.68 (4.10) [−2.10 to −1.30]	−2.69 (5.54) [−3.30 to −2.10]	−1.23 (4.04) [−1.60 to −0.80]	1 [Reference]

Open in a new tab

Abbreviations: MAE, mean absolute error; NICHD, National Institute of Child Health and Human Development.

^{^a}

There were 379 patients overall (second and third trimester) with a mean (SD) ground truth gestational age of 202.8 (41.7) days.

^{^b}

We compared the MAE and mean error of our video plus image ensemble model against that of alternative fetal biometry–based regression formulas Intergrowth-21st and NICHD, in addition to Hadlock. Error difference and 95% CI denote the ensemble model MAE less the MAE for each fetal biometry–based procedure. We performed the comparison for second and third trimester cases, as the NICHD formula and the Intergrowth-21st formula are not applicable to first trimester cases.

Analysis of per country performance is summarized in eTable 4 in Supplement 1, which shows that the accuracy of Hadlock-based GA estimates are close to Hadlock in the US (mean [SD], NICHD MAE, 4.79 [4.16] days; Hadlock MAE, 4.90 [4.32] days), while outperforming Hadlock significantly in the Zambia population (mean [SD], NICHD MAE, 4.96 [4.84] days; Hadlock MAE, 5.62 [5.20] days). The Intergrowth-21st formula performed significantly worse than Hadlock and NICHD across both populations as demonstrated in other studies.^19,20 Our ensemble model estimate was compared against NICHD, and the result showed that the ensemble model had a lower MAE for both US (mean [SD] difference 3.58 [2.79] days; 95% CI, −1.90 to −0.70 days) and Zambia (mean [SD] difference, 3.70 [3.57] days; 95% CI, −1.80 to −0.60 days), demonstrating robustness and statistical superiority on all subgroups.

Performance of Model on Suspected FGR and LGA Cases

The analysis of our model showed that it outperformed Hadlock-based estimations of GA by a wider margin on every suspected FGR or LGA-sized subgroup of data, compared with cases in the reference (suspected non-FGR and non-LGA) patient set (Table 3). AI model performance on suspected FGR cases (ensemble model, mean [SD] difference, −3.46 [5.69] days; 95% CI, −5.00 to −1.90 days) was compared with model performance on the subgroup of fetuses suspected to be reference size (ensemble model, mean [SD] difference, −1.08 [3.34] days; 95% CI, −1.40 to −0.70 days) (Figure 2). The AI model’s largest performance improvement over the Hadlock GA formula, although smallest in sample size (26 patients), could be observed in suspected severe FGR cases (ensemble model, mean [SD] difference, −4.45 [6.96] days; 95% CI, −7.30 to −1.60 days), defined by a third percentile AC threshold. We observed a similar phenomenon of more substantial performance gains for these traditionally challenging subgroups with still image models (eTable 5 in Supplement 1).

Table 3. Comparison for Fetuses That Are FGR or LGA According to AC^a.

Subgroup and estimated FGR or LGA status	Error, mean (SD), d				MAE difference vs ensemble model, mean (SD) difference [95% CI], d	Unique patients, No.
	Ensemble model		Hadlock
	MAE	ME	MAE	ME
Overall	3.53 (3.2)	−0.53 (4.74)	4.80 (4.38)	−1.32 (6.37)	−1.27 (3.70) [−1.60 to −0.90]	379
FGR (AC <10th percentile)	5.12 (5.72)	−4.41 (6.30)	8.58 (7.46)	−7.88 (8.21)	−3.46 (5.69) [−5.00 to −1.90]	57
Severe FGR (AC <3rd percentile)	6.13 (7.99)	−5.46 (8.48)	10.58 (9.65)	−10.04 (10.23)	−4.45 (6.96) [−7.30 to −1.60]	26
LGA (AC >90th percentile)	4.11 (2.86)	−0.85 (4.20)	4.96 (4.01)	2.60 (5.86)	−1.30 (3.62) [−2.00 to 0.30]	55
Reference size for GA (AC >10th to <90th percentile)	3.28 (2.54)	−0.52 (4.12)	4.36 (3.45)	−1.19 (5.44)	−1.08 (3.34) [−1.40 to −0.70]	327
FGR or LGA (AC <10th or >90th percentile)	4.61 (4.63)	−1.08 (6.46)	6.83 (6.34)	−2.74 (8.93)	−2.22 (5.22) [−3.20 to −1.20]	108

Open in a new tab

Abbreviations: AC, abdominal circumference; FGR, fetal growth restricted; GA, gestational age; LGA, large for gestational age; MAE, mean absolute error; ME, mean error.

^{^a}

We compared MAE and ME of our video plus image ensemble model against that of the standard of care Hadlock procedure for FGR and LGA fetuses. Small fetuses are defined by having an AC below the 10th percentile and are considered very small if the AC is below the third percentile. Large fetuses are defined by having an AC above the 90th percentile for their gestational age. Normal size for gestational age is defined as not suspected of being FGR or LGA, whereas FGR or LGA represents cases in which either FGR or LGA is suspected. Two sets of AC percentiles were derived from the Fetal Age Machine Learning Initiative data set based on both US and Zambia populations.

Figure 2. — Video plus image ensemble model and standard fetal biometry estimates mean absolute error (MAE) vs ground truth GA (4-week GA windows) are shown for FGR fetuses (A) and severe FGR fetuses (B). Small fetuses are defined by having an AC below the 10th percentile and are considered very small if the AC is below the 3rd percentile. P values are calculated for each 4-week GA window for the 1-sided test with null hypothesis that the median of MAE differences (model MAE minus standard fetal biometry estimates MAE) is positive. Error bars denote 95% CIs.

Discussion

In this diagnostic study, we demonstrated that our AI models provided GA estimates with improved precision. We showed that our AI systems could use both images and fly-to videos, which are already collected as part of the standard fetal biometry measurements. Previous studies²¹ have demonstrated the possibility of automated detection of the standard examination plane from the acquired 3D volume data (voxel); however, 3D volume data are not widely used in the standard of care because of the extra hardware requirements. Our 3 models (image, video, and ensemble) each provided statistically superior GA estimates compared with the clinical standard fetal biometry. These models make use of the entire image or video without needing the operator to accurately place calipers for precise measurements. In our exploratory analyses, our models performed well across different trimesters, devices, and populations from 2 different countries. GA estimation is known to be less accurate as pregnancy progresses since the correlation between GA and physical size of the fetus is less pronounced.²² We found that in the third trimester, our model’s accuracy advantage relative to the clinical standard fetal biometry increased. This is particularly important because accurate GA estimation in the third trimester is essential for managing complications and making appropriate clinical decisions regarding timing of delivery.

Fetal growth restriction is a major complication in pregnancy. Worldwide, low-birth-weight neonates account for 60% of neonatal deaths; the most common contributors to low birth weight are prematurity and fetal growth restriction.²³ In our exploratory analyses, our models had a significant increase in relative performance over fetal biometry when estimating the GA of fetuses who were suspected to be FGR. These subgroups are particularly challenging for fetal biometry estimation as the formulas rely on fetal size measurements. More accurately identifying fetuses that are FGR will help guide critical clinical care decisions, such as antenatal medication administration, antenatal surveillance intervals, and delivery timing and level-of-care needs.^24,25

Limitations

One limitation of this study is that several of our subgroups had small sample sizes. Collecting additional ultrasonography examinations in the first trimester and in FGR cases will be needed to confirm our findings. In addition, although this study included patients from 2 countries, it is important to validate in a more diverse population to confirm generalizability, because fetal growth patterns differ in different populations.²⁶ Body mass index and other relevant factors may also be useful to collect and study in future studies. Also of note, our model has not been tested on multifetal gestations or fetuses with abnormal anatomy. The ultrasonography examinations were performed by expert sonographers with experience in fetal ultrasonography. Additional studies with a variety of sonographers will be helpful in evaluating the generalizability of this study. Furthermore, although we show that our models achieve statistical significance, a prospective study is needed to evaluate clinical impact.

Conclusions

In conclusion, this diagnostic study shows that our image model, video model, and ensemble model provide statistically superior GA estimation compared with the clinical standard fetal biometry. Our models had a significant increase in relative performance over fetal biometry in the third trimester and when evaluating fetuses who were FGR. Since our models are built on data collected during routine fetal ultrasonography examinations, they have the potential of being incorporated seamlessly into the routine clinical workflow. Sonographers are in high demand and often have workplace or overuse injuries due to current scanning requirements. Additional studies are needed to investigate whether an AI adjunct can reduce scanning time, assist sonographers, and minimize workplace injury. Our AI models have the potential to empower trained operators to estimate GA with higher accuracy.

Supplement 1.

eFigure 1. Fetal Biometry Ultrasound Standard Plane Images for Fetal Biometry Measurement

eFigure 2. Image Model Architecture

eFigure 3. Video Model Architecture

eFigure 4. Fly-to Video Example Images, the Displayed Frames Are the Central Frame in a Video Sequence

eAppendix. Model Architecture, Model Training, Data Preprocessing, Inference and Confidence Estimate, and Suspected Fetal Growth Restricted (FGR) and Large for Gestational Age (LGA) Cases

eFigure 5. STARD Diagrams: Standard Procedure Performed by Trained Sonographers (FAMLI Study)

eTable 1. Characteristics of Study Participants

eFigure 6. Gestational Age Mean Absolute Error vs Patient Characteristics

eFigure 7. Model and Standard Fetal Biometry Estimate Absolute Error Cumulative Distribution Plots

eTable 2. Gestational Age Estimation Subgroup Analysis Split by Trimester

eTable 3. Subgroup Analysis Split by Country and Manufacturer Device

eTable 4. Comparison Against Alternative Biometry Regression Formulae Split by Trimester and Country

eTable 5. Still Image Model Comparison for Fetuses That Are Small (SGA) or Large (LGA) for Their Gestational Age Based on Abdominal Circumference (AC)

eReferences

Click here for additional data file.^{(942.6KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(16.7KB, pdf)}

References

1.Hadlock FP, Harrist RB, Fearneyhough TC, Deter RL, Park SK, Rossavik IK. Use of femur length/abdominal circumference ratio in detecting the macrosomic fetus. Radiology. 1985;154(2):503-505. doi: 10.1148/radiology.154.2.3880915 [DOI] [PubMed] [Google Scholar]
2.Hadlock FP, Harrist RB, Sharman RS, Deter RL, Park SK. Estimation of fetal weight with the use of head, body, and femur measurements: a prospective study. Am J Obstet Gynecol. 1985;151(3):333-337. doi: 10.1016/0002-9378(85)90298-4 [DOI] [PubMed] [Google Scholar]
3.Ravishankar H, Prabhu SM, Vaidya V, Singhal N. Hybrid approach for automatic segmentation of fetal abdomen from ultrasound images using deep learning. 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI). June 16, 2016. doi: 10.1109/ISBI.2016.7493382 [DOI] [Google Scholar]
4.Kim B, Kim KC, Park Y, Kwon JY, Jang J, Seo JK. Machine-learning-based automatic identification of fetal abdominal circumference from ultrasound images. Physiol Meas. 2018;39(10):105007. doi: 10.1088/1361-6579/aae255 [DOI] [PubMed] [Google Scholar]
5.Kim HP, Lee SM, Kwon JY, Park Y, Kim KC, Seo JK. Automatic evaluation of fetal head biometry from ultrasound images using machine learning. Physiol Meas. 2019;40(6):065009. doi: 10.1088/1361-6579/ab21ac [DOI] [PubMed] [Google Scholar]
6.Płotka S, Klasa A, Lisowska A, et al. Deep learning fetal ultrasound video model match human observers in biometric measurements. Phys Med Biol. 2022;67(4):045013. doi: 10.1088/1361-6560/ac4d85 [DOI] [PubMed] [Google Scholar]
7.Rasheed K, Junejo F, Malik A, Saqib M. Automated fetal head classification and segmentation using ultrasound video. IEEE Access. 2021;9:160249-160267. doi: 10.1109/ACCESS.2021.3131518 [DOI] [Google Scholar]
8.Bano S, Dromey B, Vasconcelos F, et al. AutoFB: automating fetal biometry estimation from standard ultrasound planes. In: de Bruijne M, Cattin PC, Cotin S, et al. , eds. Medical Image Computing and Computer Assisted Intervention–MICCAI 2021. Springer International Publishing; 2021:228-238. doi: 10.1007/978-3-030-87234-2_22 [DOI] [Google Scholar]
9.Płotka S, Włodarczyk T, Klasa A, Lipa M, Sitek A, Trzciński T. FetalNet: multi-task deep learning framework for fetal ultrasound biometric measurements. In: Mantoro T, Lee M, Ayu MA, Wong KW, Hidayanto AN, eds. Neural Information Processing. Springer International Publishing; 2021:257-265. doi: 10.1007/978-3-030-92310-5_30 [DOI] [Google Scholar]
10.Gomes RG, Vwalika B, Lee C, et al. A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment. Commun Med (Lond). 2022;2:128. doi: 10.1038/s43856-022-00194-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fetal Age Machine Learning Initiative (FAMLI). UNC global women’s health. Published 2018. Accessed 2018. https://researchforme.unc.edu/index.php/en/study-details?rcid=251
12.Pokaprakarn T, Prieto JC, Price JT, et al. AI estimation of gestational age from blind ultrasound sweeps in low-resource settings. NEJM Evidence. 2022;1(5):EVIDoa2100058. doi: 10.1056/EVIDoa2100058 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). November 9, 2017. doi: 10.1109/CVPR.2017.502 [DOI] [Google Scholar]
14.Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. December 16, 2018. doi: 10.1109/CVPR.2018.00474 [DOI] [Google Scholar]
15.Nix DA, Weigend AS. Estimating the mean and variance of the target probability distribution. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94); 1994. doi: 10.1109/ICNN.1994.374138 [DOI] [Google Scholar]
16.Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv. Preprint posted online November 4, 2017. doi: 10.48550/arXiv.1612.01474 [DOI]
17.Cawyer CR, Anderson SB, Szychowski JM, Skupski DW, Owen J. Estimating gestational age from ultrasound: external validation of the NICHD formula with comparison to the Hadlock regression. Am J Perinatol. 2019;36(10):985-989. doi: 10.1055/s-0039-1681055 [DOI] [PubMed] [Google Scholar]
18.Deb S, Mohammed MS, Dhingra U, et al. ; WHO Alliance for Maternal and Newborn Health Improvement Late Pregnancy Dating Study Group . Performance of late pregnancy biometry for gestational age dating in low-income and middle-income countries: a prospective, multicountry, population-based cohort study from the WHO Alliance for Maternal and Newborn Health Improvement (AMANHI) Study Group. Lancet Glob Health. 2020;8(4):e545-e554. doi: 10.1016/S2214-109X(20)30034-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Duncan JR, Odibo L, Hoover EA, Odibo AO. Prediction of large-for-gestational-age neonates by different growth standards. J Ultrasound Med. 2021;40(5):963-970. doi: 10.1002/jum.15470 [DOI] [PubMed] [Google Scholar]
20.Buck Louis GM, Grewal J, Albert PS, et al. Racial/ethnic standards for fetal growth: the NICHD Fetal Growth Studies. Am J Obstet Gynecol. 2015;213(4):449.e1. doi: 10.1016/j.ajog.2015.08.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Skelton E, Matthew J, Li Y, et al. Towards automated extraction of 2D standard fetal head planes from 3D ultrasound acquisitions: a clinical evaluation and quality assessment comparison. Radiography (Lond). 2021;27(2):519-526. doi: 10.1016/j.radi.2020.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Yang F, Leung KY, Lee YP, Chan HY, Tang MHY. Fetal biometry by an inexperienced operator using two- and three-dimensional ultrasound. Ultrasound Obstet Gynecol. 2010;35(5):566-571. doi: 10.1002/uog.7600 [DOI] [PubMed] [Google Scholar]
23.Lawn JE, Cousens S, Zupan J; Lancet Neonatal Survival Steering Team . 4 Million neonatal deaths: when? where? why? Lancet. 2005;365(9462):891-900. doi: 10.1016/S0140-6736(05)71048-5 [DOI] [PubMed] [Google Scholar]
24.Morken NH, Klungsøyr K, Skjaerven R. Perinatal mortality by gestational week and size at birth in singleton pregnancies at and beyond term: a nationwide population-based cohort study. BMC Pregnancy Childbirth. 2014;14:172. doi: 10.1186/1471-2393-14-172 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Källén K. Increased risk of perinatal/neonatal death in infants who were smaller than expected at ultrasound fetometry in early pregnancy. Ultrasound Obstet Gynecol. 2004;24(1):30-34. doi: 10.1002/uog.1082 [DOI] [PubMed] [Google Scholar]
26.Workalemahu T, Grantz KL, Grewal J, Zhang C, Louis GMB, Tekola-Ayele F. Genetic and environmental influences on fetal growth vary during sensitive periods in pregnancy. Sci Rep. 2018;8(1):7274. doi: 10.1038/s41598-018-25706-z [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1.

eFigure 1. Fetal Biometry Ultrasound Standard Plane Images for Fetal Biometry Measurement

eFigure 2. Image Model Architecture

eFigure 3. Video Model Architecture

eFigure 4. Fly-to Video Example Images, the Displayed Frames Are the Central Frame in a Video Sequence

eAppendix. Model Architecture, Model Training, Data Preprocessing, Inference and Confidence Estimate, and Suspected Fetal Growth Restricted (FGR) and Large for Gestational Age (LGA) Cases

eFigure 5. STARD Diagrams: Standard Procedure Performed by Trained Sonographers (FAMLI Study)

eTable 1. Characteristics of Study Participants

eFigure 6. Gestational Age Mean Absolute Error vs Patient Characteristics

eFigure 7. Model and Standard Fetal Biometry Estimate Absolute Error Cumulative Distribution Plots

eTable 2. Gestational Age Estimation Subgroup Analysis Split by Trimester

eTable 3. Subgroup Analysis Split by Country and Manufacturer Device

eTable 4. Comparison Against Alternative Biometry Regression Formulae Split by Trimester and Country

eTable 5. Still Image Model Comparison for Fetuses That Are Small (SGA) or Large (LGA) for Their Gestational Age Based on Abdominal Circumference (AC)

eReferences

Click here for additional data file.^{(942.6KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(16.7KB, pdf)}

PERMALINK

Development of a Machine Learning Model for Sonographic Assessment of Gestational Age

Chace Lee, MS

Angelica Willis, MS

Christina Chen, MD

Marcin Sieniek, PhD

Amber Watters, MD

Bethany Stetson, MD

Akib Uddin, MHS

Jonny Wong, BA

Rory Pilgrim, BEng, LLB

Katherine Chou, MS

Daniel Tse, MD

Shravya Shetty, MS

Ryan G Gomes, PhD

Key Points

Question

Findings

Meaning

Abstract

Importance

Objective

Design, Setting, and Participants

Main Outcomes and Measures

Results

Conclusions and Relevance

Introduction

Methods

Algorithm Development

Model Architecture

Ensemble

Statistical Analysis

AI GA Estimation Using Ultrasonography Images and Fly-To Videos

GA Estimation Compared With Alternative Formulas

Suspected Fetal Growth Restricted and Large for GA Cases

Results

Clinical Characteristics

AI GA Estimation Using Ultrasonography Images and Fly-To Videos

Table 1. GA Estimation Overall Performancea.

Figure 1. Gestational Age (GA) Estimation.

GA Estimation Compared With Alternative Formulas

Table 2. Comparison Against Alternative Biometry Regression Formulasa.

Performance of Model on Suspected FGR and LGA Cases

Table 3. Comparison for Fetuses That Are FGR or LGA According to ACa.

Figure 2. Gestational Age (GA) Estimation Comparison for Fetuses That Are Suspected to Be Fetal Growth Restricted (FGR) According to Abdominal Circumference (AC).

Discussion

Limitations

Conclusions

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. GA Estimation Overall Performance^a.

Table 2. Comparison Against Alternative Biometry Regression Formulas^a.

Table 3. Comparison for Fetuses That Are FGR or LGA According to AC^a.