Abstract
Objectives
To develop and evaluate automated segmentation models for the liver and hepatic tumors on 18F-fluorodeoxyglucose positron emission tomography/computed tomography (18F-FDG PET/CT) using SwinUNETR and residual UNET architectures, and to assess their accuracy in complex clinical cases.
Methods
In this single-center retrospective study, 100 patients (48 males, 52 females; mean age 61±14 years) with 18F-FDG-avid hepatic lesions from various primary malignancies were included. Liver segmentation was performed on non-contrast CT images using pairs of SwinUNETR and residual UNET models, and tumor segmentation was performed on masked PET images using separately trained pair of SwinUNETR and residual UNET model. Model performance was evaluated using the dice similarity coefficient (DSC), volumetric bias, and Bland-Altman analysis for metabolic tumor volume (MTV) and total lesion glycolysis (TLG).
Results
For liver segmentation, SwinUNETR achieved a median DSC of 97.59% (range: 95.41-98.93%) with a median volumetric bias of -0.94% (LoA: -3.76% to +0.50%), while residual UNET achieved a median DSC of 97.85% (range: 94.81-98.80%) with a median volumetric bias of -0.34% (LoA: -2.63% to +1.16%). For tumor segmentation, SwinUNETR achieved a median DSC of 92.62% (range: 80.75–97.46%), an MTV bias of -8.60% (LoA: -31.62% to +1.21%), and a TLG bias of -6.40% (LoA: -25.58% to +0.76%). Residual UNET achieved a median DSC of 93.07% (range: 80.74–98.18%), MTV bias of -4.33% (LoA: -24.36% to +10.12%), and TLG bias of -11.10% (LoA: -30.8% to +4.52%). Most MTV and TLG measurements were within ±10% of reference values.
Conclusion
Both SwinUNETR and Residual UNET achieved excellent liver segmentation accuracy and clinically acceptable tumor segmentation performance on 18F-FDG PET/CT, with SwinUNETR showing slightly better performance in liver volumetric measurements. These open-source models could be integrated into clinical workflows to automate segmentation tasks, facilitate treatment planning for liver-directed therapies, and support reproducible quantitative imaging analyses.
Introduction
Primary liver malignancies, particularly hepatocellular carcinoma (HCC), represent a major global health burden, ranking as one of the leading causes of cancer-related mortality worldwide (1). In addition to primary tumors, the liver is a frequent site of metastatic spread from various malignancies, including colorectal, breast, and pancreatic cancers (2, 3, 4). Early detection and accurate characterization of hepatic lesions are essential, as the prognosis of patients with liver involvement depends heavily on timely diagnosis and appropriate therapeutic intervention. Proper treatment planning—whether through surgical resection, transplantation, systemic therapy, or locoregional approaches—can significantly improve survival outcomes in both primary and secondary hepatic malignancies.
Accurate delineation of the liver and its tumors plays a pivotal role in several advanced treatment strategies. For therapies such as selective internal radiation therapy (SIRT) (SIRT, also known as radioembolization) and stereotactic body radiotherapy, precise volumetric and spatial characterization of tumor burden is required to optimize dosimetry, minimize healthy tissue damage, and maximize therapeutic efficacy (5-7). Furthermore, quantitative imaging biomarkers that have been shown to be reliable prognostic factors after radioembolization, such as metabolic tumor volume (MTV) and total lesion glycolysis (TLG) on 18F-fluorodeoxyglucose positron emission tomography/computed tomography (18F-FDG PET/CT), rely on precise segmentation to ensure reproducibility across clinical and research settings (8, 9).
Over the past decade, deep learning–based methods have revolutionized medical image segmentation, with convolutional neural networks (CNNs) and, more recently, transformer-based architectures delivering state-of-the-art performance (10-12). Tools such as TotalSegmentator have demonstrated the potential of generalized pre-trained models to achieve high accuracy in multi-organ segmentation tasks (13). In liver imaging, these approaches have significantly reduced the need for labor-intensive manual contouring, thus accelerating clinical workflows and enabling large-scale quantitative studies.
The SwinUNETR architecture, a transformer-based model incorporating hierarchical shifted-window self-attention and UNet-style encoder–decoder design, has shown strong performance in complex 3D segmentation tasks (10, 11). By leveraging global contextual information while preserving fine anatomical details, SwinUNETR has the potential to outperform conventional CNN-based architectures in challenging segmentation scenarios. In clinical reality, diseased livers often present with anatomical distortions caused by ascites, postoperative changes, large tumor burdens, or extensive metastatic infiltration. Such conditions may degrade the performance of general-purpose segmentation models, underscoring the need for disease-specific model training tailored to these complex cases.
Previous studies on liver segmentation using neural networks have generally employed fully convolutional architectures such as Residual UNET and have been performed on contrast-enhanced CT images. For example, in a recent study, Yashaswini et al. (14) evaluated the performance of Residual UNET models for liver and tumor segmentation on CT imaging and reported a Dice score of 91.44% for liver segmentation. Additionally, several other studies have investigated liver and tumor segmentation using CNNs (15, 16). However, the utility and potential superiority of SwinUNETR for liver and tumor segmentation, compared to Residual UNET models, have not yet been explored. Furthermore, although there are multiple studies on tumor segmentation in PET imaging, research combining PET and CT imaging for segmentation remains rare.
In this study, we aimed to develop and evaluate automated segmentation models, using both SwinUNETR-V2 and residual U-Net architectures, to segment the liver and hepatic tumors from18F-FDG PET/CT images. Our goal was to assess their accuracy in the context of challenging clinical cases and to explore the feasibility of disease-specific segmentation models that can maintain robust performance in anatomically complex livers. The developed models are also intended for use in conjunction with the previously developed radioembolization dosimetry module for 3D Slicer (17).
Materials and Methods
Patients and Study Design
This single-center, retrospective study included patients with 18F-FDG-avid hepatic lesions from various malignancies who underwent 18F-FDG PET/CT imaging from January 2025 to July 2025. Written informed consent was obtained from all patients before imaging. Exclusion criteria were: (1) significant respiratory artefacts; (2) artefacts secondary to patient motion; and (3) artefacts secondary to metallic objects or prostheses on CT imaging. The study was approved by the Ethics Committee of Harran University (approval no: HRÜ-25.11.02, date: 16.06.2025), with additional approval from the institutional review board. The developed segmentation models and training scripts (18, 19), images of four patients for testing (20), and the SlicerAether segmentation module for 3D Slicer (18) are available in public repositories.
18F-FDG PET/CT Protocol and Preprocessing of the Data
Imaging was performed using a Siemens Biograph Horizon™ 4R system. Patients fasted for at least 6 hours before imaging, and blood glucose levels were checked prior to the scan. Those with a blood glucose level above 200 mg/dL did not undergo scanning. Images were acquired from the vertex to the proximal femur with the patient in the supine position. Whole-body 18F-FDG PET/CT imaging was performed approximately 1 h after an intravenous injection of 18F-FDG at 3.7 MBq/kg. For PET/CT imaging, PET images were acquired for 90 seconds per bed position and were reconstructed using attenuation correction measured from non-contrast CT images. For reconstruction of PET images, the TrueX+TOF (UltraHD-PET) algorithm was used with 4 iterations and 10 subsets, a 5-mm post-processing Gaussian filter, and a 180 × 180 matrix. The resulting voxel size was 4.11392 × 4.11392 × 1.50 mm. All PET images were converted to standardized uptake values (SUVs), normalized to body weight, and resampled to an isotropic voxel size of 2 × 2 × 2 mm prior to training. No further normalization other than conversion to SUV values was used.
Non-contrast-enhanced CT images were acquired at 130 kV with a variable tube current modulated according to patient weight using the CareDose4D (Siemens Healthineers) and reconstructed with a 512 × 512 matrix. The resulting voxel size was 1.367 × 1.367 × 1.50 mm. Similarly, the CT images were resampled to an isotropic voxel size of 2×2×2 mm before training. For CT images, voxel intensities were normalized using a linear scaling transformation in which values between -135 hounsfield unit (HU) and +215 HU were mapped to the range 0.0-10.0; values outside this range were clipped to the nearest boundary.
For preprocessing, PET and CT volumes were cropped to a bounding box encompassing the upper abdomen to remove empty voxels and reduce computational load. The liver was manually segmented on CT images for all patients, and the resulting liver masks were used to zero out voxels outside the liver in the PET volumes. A spherical reference volume of interest was placed in the non-tumoral liver parenchyma, and a threshold equal to 1.5 times the liver reference SUVmean was used for manual tumor segmentation. Afterwards, the CT images, masked PET images, and liver and tumor segmentation masks were saved for further processing.
Training volumes were split into overlapping 96 × 96 × 96-voxel patches, yielding 680 training pairs. No patching was applied during testing; instead, a sliding-window inference with the same patch size was used. Preprocessing was performed using 3D Slicer (version 5.9) and custom Python scripts (21, 22).
Model Architecture Loss Function and Training Parameters
A volumetric segmentation model based on the SwinUNETR architecture, originally proposed by Hatamizadeh et al. (10) and later extended by He et al. (11) as SwinUNETR-V2, was implemented. SwinUNETR-V2 integrates the Swin Transformer with a UNET-style encoder–decoder and residual convolutional blocks at the start of each Swin stage, enabling high representational capacity for 3D medical images (10, 23). In this study, one SwinUNETR-V2 model was trained to segment the liver in CT images, and another model with identical parameters was trained to segment tumors in masked PET images. Both models used a feature size of 24, transformer depths of (2, 2, 2, 2), attention heads of (3, 6, 12, 24), a dropout path rate of 0.0, input volumes of 96 × 96 × 96 voxels, and gradient checkpointing to reduce memory usage. Each model had approximately 18.3 million trainable parameters and was trained with a batch size of 1.
We also utilized UNET-structured models with residual blocks for comparison with SwinUNETR models (12). The 3D residual UNET model was configured with an input batch size of 4 and a total of 76.8M trainable parameters. This network employed five resolution levels with channel sizes of 64, 128, 256, 512, and 1024; two residual units per level; strides of (2, 2, 2, 2) for down- and up-sampling; and 3 × 3 × 3 convolution kernels.
The implementation was based on the PyTorch and MONAI frameworks and executed on a graphics processing unit (GPU)-enabled system, allowing efficient handling of 3D volumetric data (20, 21, 22). The GPU and central processing unit models used for training were an NVIDIA GeForce RTX 4060 with 8 GB of VRAM and an Intel Core i3-9100F (3.60 GHz). The Dice similarity coefficient (DSC) was calculated as follows (24, 25, 26):
Dice score = 2 x |X ∩ Y| / |X| + |Y|
Here, X denotes the set of voxels in the predicted segmentation; Y denotes the set of voxels in the ground-truth segmentation; and ∣X ∩ Y∣ denotes the number of overlapping voxels. Dice loss was defined as follows:
Dice loss = 1 – Dice score.
In addition, cross-entropy loss values were calculated, and a hybrid loss function was used for training:
Training Loss Function = 0.5 ´ Dice Loss + 0.5 ´
Cross Entropy Loss
Testing of the Models and Performance Evaluation Metrics
A total of four models were developed for the segmentation of the liver and liver lesions. For liver segmentation, the reference liver volume, the predicted liver volume, and their intersection were computed. Model performance was assessed using the DSC, where a value of 1 indicates perfect overlap between the predicted and reference segmentation, and a value of 0 indicates no overlap.
For tumor segmentation on PET images, the DSC was also used as the primary evaluation metric. In addition, MTV and TLG were calculated for both the reference and model-predicted segmentations. TLG was defined as:
TLG = MTV x SUVmean
Statistical Analysis
Descriptive statistics were reported as counts and percentages for categorical variables, and as mean ± standard deviation and median (range) for continuous variables. A p-value less than 0.05 was considered statistically significant for all analyses. Dice scores obtained from the SwinUNETR and residual UNET models were compared using the Wilcoxon signed-rank test.
For tumor segmentation, predicted and reference MTV and TLG values were compared using Bland-Altman plots. The bias, along with 95% confidence intervals (CIs) and limits of agreement (LoA), was calculated for both models. All statistical analyses were performed using RStudio (version 2025.05.1), IBM SPSS Statistics (version 27), and BA-plotteR (27, 28).
Results
Patients and General Characteristics
A total of 110 patients were initially considered for inclusion. Six patients were excluded due to respiratory artifacts, and four were excluded due to metallic artifacts in the upper abdominal CT images. Consequently, 100 patients (48 males, 52 females) with various malignancies were included in the study. The mean age was 61±14 years. The most common primary malignancies were breast cancer (28%), colorectal carcinoma (23%), lung cancer (13%), gastric cancer (8%), and pancreatic cancer (6%). The remaining patients had HCC, lymphoma, ovarian cancer, esophageal cancer, gallbladder cancer, cervical cancer, soft tissue sarcoma, tumors of unknown origin, or thyroid cancer. More than half of the patients (55%) had more than five FDG-avid liver lesions, 22% had 2-5 FDG-avid lesions, and 23% had a single FDG-avid lesion. The liver reference SUVmean was 2.17±0.48 g/mL, and the mean tumor SUVmax was 10.52±7.50 g/mL.
Patients were randomly assigned to a training set (n=85) and a test set (n=15). In the test set, nine patients were female and six were male. Primary malignancies in this group included breast cancer (n=6), colorectal carcinoma (n=3), lung cancer (n=2), and lymphoma (n=2). The remaining patients had pancreatic cancer, tumors of unknown origin, or esophageal cancer. The mean age of the test group was 59±15 years, and the mean reference liver SUVmean was 2.25±0.32 g/mL.
Segmentation of Liver on CT Images
In the test group, the median reference liver volume was 1679 mL (range: 887.6-2536.3 mL). The SwinUNETR model achieved a median Dice score of 97.59% (range: 95.41%-98.93%). The median liver volume estimated by SwinUNETR was 1672.2 mL (range: 872.9-2414.4 mL). Bland-Altman analysis demonstrated a median bias of --0.94% (95% CI: -1.05 to -0.64), with lower and upper LoA of -3.76% and +0.50%, respectively (Figure 1). These results indicate that SwinUNETR slightly underestimated the liver volume but maintained high segmentation accuracy.
The Residual UNET model achieved a median dice score of 97.85% (range, 94.81-98.80%). The median liver volume estimated by Residual UNET was 1693.17 mL (range: 891.24-2361.7 mL). Bland-Altman analysis revealed a median bias of -0.34% (95% CI: -0.58 to -0.17); LoA: -2.63% to +1.16%. When comparing the dice scores of the two models, SwinUNETR had higher scores in 13 patients (87%) and lower scores in 2 patients (13%) (p=0.036; Figure 2). The DSC values for each patient and the differences between SwinUNETR and Residual UNET models in liver segmentation are given in Table 1.
Segmentation of Tumors on Masked PET Images
In the test group, the median number of liver tumors was 6 (range: 1-39), and the median reference MTV was 58.71 mL (range: 2.20-374.20 mL). The median SUVmax, SUVmean, and TLG values in the reference segmentations were 9.98 g/mL (range: 5.46-18.65 g/mL), 4.76 g/mL (range: 3.20-9.21 g/mL), and 337.92 g (range: 8.76-3447.90 g), respectively.
The SwinUNETR model achieved a median dice score of 92.62% (range: 80.75%-97.46%). The median MTV and TLG estimated by SwinUNETR were 50.84 mL (range: 1.62-343.36 mL) and 287.11 g (range: 6.86-3334.18 g), respectively. In the Bland-Altman analysis, the SwinUNETR model demonstrated a median bias of -8.60% (95% CI: -16.8 to -2.15) for MTV, with lower and upper LoA of -31.62% and +1.21%, respectively. Similarly, SwinUNETR model had a median bias of -6.40% (95% CI: -10.08 to -2.13) for TLG, with lower and upper LoA of -25.58% and +0.76%, respectively.
The Residual UNET model achieved a median dice score of 93.07% (range: 80.74-98.18%). The medians of MTV and TLG estimated by Residual U-Net were 56.22 mL (range: 1.70-400.15 mL) and 269.20 g (range: 6.40-3015.43 g), respectively. In the Bland-Altman analysis, the Residual UNET model demonstrated a median bias of -4.33% (95% CI: -10.62% to -1.59%) for MTV, with lower and upper LoA of -24.36% and +10.12%, respectively. Similarly, the Residual UNET model showed a median bias of -11.10% (95% CI: -16.87 to -6.22) for TLG, with lower and upper LoA of -30.8% and +4.52%, respectively. When dice scores were compared, SwinUNETR outperformed Residual UNET in 8 patients (53%) and scored lower in 7 patients (47%) (p=0.570). Examples of patient segmentation results are shown in Figures 3 and 4. The DSC values for each patient and the differences between SwinUNETR and Residual UNET models in tumor segmentation are given in Table 2.
Discussion
In this study, both the SwinUNETR and residual UNET models achieved excellent performance in liver segmentation on CT images, with median dice scores exceeding 97% and narrow LoA. Although the SwinUNETR model slightly outperformed the residual UNET in terms of dice score, the difference was modest, and both approaches demonstrated highly reliable volumetric agreement with reference segmentations. While both models may produce errors in patients with liver disease such as hepatosteatosis (Figure 5) or ascites (Figure 2) these results indicate that transformer-based and residual convolutional architectures are viable options for accurate hepatic segmentation in clinical and research settings.
For tumor segmentation on masked PET images, both models also demonstrated high performance, although their accuracy was lower than for liver segmentation. This is not unexpected, as tumor segmentation in FDG PET is inherently more challenging. Factors such as image noise, heterogeneous tracer uptake, and the presence of physiological uptake in adjacent structures can introduce false-positive voxels. Furthermore, variations in SUV thresholding methods can lead to differences in measured MTV and TLG, even for the same lesion. Despite these challenges, the majority of the predicted MTV and TLG values in our study were within ±10% of reference measurements, a level of agreement that is likely sufficient for many clinical applications, including treatment planning and response assessment. From a practical standpoint, these models could be integrated into clinical workflows to automate time-consuming segmentation tasks, assist in treatment planning for radiotherapy or radioembolization, and provide reproducible volumetric measurements for research studies. Given their open-source availability, they can also serve as a foundation for further development, including fine-tuning for specific scanner protocols or disease subtypes.
Our results compare favorably with the literature. Previous studies have reported Dice scores for liver segmentation in the range of 94-97% using deep learning methods (29, 30, 31), placing both of our models at the higher end of this reported range. In tumor segmentation using deep learning methods, Leung et al. (32) developed models using 18F-FDG PET/CT and Galium-68 prostate-specific membrane antigen PET/CT and showed that median DSCs of up to 0.83 can be achieved for patients with lung cancer, melanoma, lymphoma, and prostate cancer. Although the dice scores achieved by both models (median >92%) indicate a high degree of accuracy, particularly given the heterogeneity of the test cohort, we masked the liver segment to simplify a two-pass algorithm; therefore, direct comparison was not feasible. Our approach differs by being specifically optimized for hepatic tumor segmentation on PET, potentially enhancing performance in cases of complex intrahepatic disease. In this context, Luo et al. (33) investigated the role of deep learning models in the detection and diagnosis of focal lesions in 18F-FDG PET/CT images and achieved a Dice coefficient of 0.740. In addition, the developed models demonstrated high performance in differentiating benign from malignant liver nodules.
Study Limitations
Our study has several limitations. First, it was conducted at a single center; external validation on datasets from other institutions would be necessary to confirm generalizability. Second, although our models demonstrated high accuracy, tumor segmentation performance was still influenced by PET noise and by the thresholding approach used to generate ground truth. Third, we were unable to compare our results directly with TotalSegmentator because our ground-truth labels include the intrahepatic segments of the inferior vena cava and the portal vein, which TotalSegmentator delineates as separate structures. Finally, although our test set contained a range of primary and metastatic lesions, sample sizes for certain tumor subtypes were relatively small, which may limit the generalizability of our findings across all disease presentations.
Conclusion
Both the SwinUNETR and residual UNET models achieved excellent accuracy for liver segmentation and high performance for hepatic tumor segmentation on 18F-FDG PET/CT, with most volumetric measurements falling within clinically acceptable limits. While SwinUNETR demonstrated slightly superior performance, both architectures showed potential for integration into clinical workflows and research pipelines. Given their open-source availability and adaptability, these models could support automated, reproducible segmentation in treatment planning and quantitative imaging.


