Large-Scale Chest X-Ray Pathology Detection

CheXVision.

Comparing a custom SE-ResNet trained from scratch against a DenseNet-121 transfer-learning pipeline on 112,120 NIH chest X-rays — under identical evaluation conditions.

Team BIG D(ATA) — AIN Program

Dataset NIH Chest X-ray14 · 112,120 images

Tasks 14-label pathology · Binary Normal/Abnormal

Code github.com/arudaev/chexvision

The Problem

Radiologists read billions of chest X-rays a year.
The bottleneck is attention, not imaging.

Chest radiography is the most common medical imaging procedure worldwide. Getting reliable, low-cost triage to every patient — especially in under-served regions — is a true ML-for-healthcare problem.

Scale

2B+

Chest X-rays acquired globally each year.

Coverage gap

2/3

Of the world's population lacks timely access to radiology.

The ask

Pathologies to be detected simultaneously from a single frontal view.

01 / 19Motivation

NIH Chest X-ray14

A genuine big-data benchmark — 112,120 frontal X-rays, 14 weak labels.

Images

112,120

Frontal-view chest X-rays · 1,024² source, resized to 320².

Shards

Parquet shards on HF Hub · ~7.97 GB total.

Labels

Patient-level, noisy — extracted from reports by NLP.

Split

Train

78,468

70%

Validation

11,210

10%

Test

22,442

20%

14 Pathology Classes

AtelectasisCardiomegaly EffusionInfiltration MassNodule PneumoniaPneumothorax ConsolidationEdema EmphysemaFibrosis Pleural ThickeningHernia

Two derived tasks

Multi-label — an image may carry zero, one, or several conditions simultaneously.
Binary — "No Finding" ↦ Normal, any label ↦ Abnormal (fast triage head).

02 / 19NIH Chest X-ray14

Dual-Task Architecture

One forward pass, two clinically-aligned predictions.

A shared feature extractor feeds two heads — a fast triage decision and a detailed pathology profile. Trained jointly with a combined loss so both objectives shape the representation.

03 / 19Shared backbone · Two heads

— PART 01

Approach.

Two architectures, one data pipeline, one evaluation protocol. A custom SE-ResNet trained from scratch vs. DenseNet-121 transferred from ImageNet — a fair, head-to-head comparison under identical conditions.

04 / 19Approach

Model 1 · Built from scratch

Custom SE-ResNet · depth [3, 4, 6, 3] · Kaiming init · no pretraining.

Parameters23M

Residual stages — 4 stages, ResNet-50 topology.
Squeeze-Excitation — per-block channel attention recalibrates features.
Dual heads share a 512-dim global-pooled representation.
Purpose — strong baseline trained without outside data.

04 / 19Custom residual CNN

Model 2 · Transfer Learning (CheXNet)

DenseNet-121 · ImageNet-pretrained · dense connectivity, 7.9M parameters.

Parameters7.9M

Dense connectivity — every layer sees all prior feature maps.
ImageNet init — strong visual priors before medical fine-tuning.
Feature layer 1024 → 512, ReLU, Dropout 0.3.
Follows CheXNet (Rajpurkar et al., 2017) — our gold-standard reference.

05 / 19DenseNet-121 transfer

Phased Fine-Tuning Schedule

Freeze the backbone first. Unfreeze once the heads have caught up.

Standard transfer-learning hygiene: a high learning rate for fresh heads destroys the pretrained backbone. We split training in two.

Phase 1 · Epochs 1–5

Backbone frozen. Only the new feature layer + two heads train, at lr = 1e-3. Heads get calibrated on medical features without corrupting ImageNet weights.

Phase 2 · Epochs 6–60

End-to-end fine-tuning at lr = 1e-4. Every layer now adapts to chest X-ray statistics; cosine schedule keeps convergence stable.

06 / 19Freeze → Unfreeze

End-to-end training pipeline

From HF dataset snapshot to a tagged checkpoint on HF Hub.

Step 01 · Data

snapshot_download pins a Parquet revision; 78,468 / 11,210 / 22,442 split.

Step 02 · Augmentation

CLAHE (LAB space), HFlip, Rotate ±15°, RandomAffine, ColorJitter, GaussianBlur, RandomErasing.

Step 03 · Forward pass

Mixed precision torch.cuda.amp · fp16 — ~2× throughput on T4.

Step 04 · Loss & optimisation

Weighted BCE with pos_weight per class · AdamW · CosineAnnealingLR · grad clip 1.0 · grad accum ×4 → effective batch 96.

Step 05 · Checkpoint + Hub

Best epoch by val macro AUC-ROC; model, config, history & card pushed to HF Hub.

07 / 19Snapshot → Augment → Train → Checkpoint → Hub

The "Big Data" side of the project

A cloud-only workflow: 8 GB of shards, no local GPU, reproducible runs.

Source code

GitHub

CI on every push — ruff + pytest (51 tests) + mypy.

Dataset

HF Hub

36 Parquet shards · 320² · pinned revision for reproducibility.

Training compute

Kaggle Kernels

Free T4 / P100 GPU · 30 h/week · kernels bundle the entire source tree.

Model artifacts · Demo

HF Hub + Spaces

Auto-upload of checkpoints; Streamlit demo deploys on push to main.

Why this matters for "Big Data"

Streaming metadata — EDA runs locally without downloading the 8 GB dataset.
Sharded Parquet — readers parallelise across shards; disk is never the bottleneck.
Mixed precision + grad accumulation — effective batch 96 on a single 16 GB T4.

Reproducibility guardrails

Pinned dataset revision — 44443e6… — same bytes, every run.
Kernel payload — dispatch.py embeds the current source tree base64-encoded.
Tagged model cards — config + metrics + history.json alongside every checkpoint.

08 / 19GitHub · HF Hub · Kaggle · Streamlit

— PART 02

Results.

Measured on the held-out validation split. Primary metric: macro-averaged AUC-ROC across all 14 pathologies. We also report binary AUC / F1 for the triage head.

09 / 19Results

Headline — best validation metrics

DenseNet-121 matches the CheXNet benchmark. The custom CNN trails by ~4.5 pp.

DenseNet-121 · Transfer

Best @ epoch 18

Macro AUC-ROC

0.8459

Binary AUC-ROC

0.7867

Binary F1

0.6736

Early-stopped — validation loss began diverging after epoch 20 while AUC plateaued.

Custom SE-ResNet · From scratch

Best @ epoch 60

Macro AUC-ROC

0.8008

Binary AUC-ROC

0.7571

Binary F1

0.6474

Still climbing slowly at epoch 75 — gap is about compute budget, not ceiling.

Takeaway. Transfer learning wins on every metric, at 3× fewer parameters and 3× faster wall-clock convergence (18 vs. 60+ epochs). ImageNet priors matter, even for medical imagery.

09 / 19Best validation metrics

Validation macro AUC-ROC over training

DenseNet saturates fast at epoch 6. The scratch CNN grinds upward for 60 epochs.

 DenseNet-121 · Transfer
 Custom SE-ResNet · Scratch
 Best checkpoint
 CheXNet benchmark (0.841)

10 / 19Validation macro AUC-ROC

DenseNet-121 · per-pathology AUC at best epoch

Strong on structural findings. Weakest on the noisiest labels.

Top performers

Edema · 0.9255, Hernia · 0.9242, Emphysema · 0.9107 — visually distinctive, well-separated classes.

Trailing

Infiltration · 0.7133 — the noisiest label in the NIH schema; radiologist agreement is low to begin with.
Pneumonia · 0.7397 — broad definition, overlaps with Consolidation & Infiltration.

Clinical reading

Classes with clear anatomical signatures (cardiomegaly, emphysema) are the easy wins. Diffuse infiltrative patterns are the hard ones — matches radiologists' own reliability.

11 / 1914-class AUC-ROC

Benchmark positioning

We match the seminal CheXNet result, at a fraction of the compute.

Context. CheXNet (Rajpurkar et al., 2017) established DenseNet-121 as the standard for this dataset. Our pipeline reproduces & slightly exceeds it — on a single free Kaggle GPU, 18 epochs, 7.9M params. The custom scratch model is +5.3 pp over the 2017 NIH paper that published the dataset.

12 / 19Macro AUC-ROC comparison

What the comparison taught us

Transfer wins — but the scratch model told us why.

	Custom SE-ResNet (scratch)	DenseNet-121 (transfer)
Parameters	~23 M	7.9 M
Epochs to best	60+ (still climbing)	18
Macro AUC-ROC	0.8008	0.8459 (+4.5 pp)
Binary AUC / F1	0.7571 / 0.6474	0.7867 / 0.6736
When it fails	Rare, diffuse, subtle classes — never sees enough examples with random init.	Same failure modes, but with a higher floor — pretrained edges + textures transfer cleanly.
Our read	A strong "sanity baseline" and ablation for the ImageNet-prior question.	Production-grade. We'd ship this if the project had a clinical partner.

13 / 19Head-to-head comparison

What to trust, what not to trust

Honest about the ceiling.

Label quality

NIH labels are patient-level, extracted by NLP from reports — not lesion-level.
"Infiltration" & "Pneumonia" overlap even between radiologists. Our lowest AUCs are where ground truth itself is noisy.

Evaluation scope

AUCs reported are on the validation split. Test set is held for the final report.
No external-hospital evaluation. Distribution shift from equipment / preprocessing will degrade performance.

Deployment caveats

Not clinical. This is a research / educational system. Predictions do not substitute a radiologist.
The binary triage head is the more defensible output — it fails gracefully and can be calibrated to a sensitivity target.

Compute envelope

Single free T4 GPU on Kaggle, 30 h/week cap — no hyperparameter sweeps, no cross-validation.
Scratch model did not converge within budget; more epochs may narrow the gap but not close it.

14 / 19Limitations

Where this goes next

Three directions, ranked by expected lift.

01 · Test-Time Augmentation + Ensemble

Average 4 views (identity, h-flip, ±7° rotate) per model, then average both models. The two architectures fail on different examples — the ensemble consistently improves macro AUC with zero retraining.

02 · External validation

Evaluate on CheXpert or MIMIC-CXR without fine-tuning. Quantify the true distribution-shift penalty; this is the question any clinical reviewer will ask first.

03 · Grad-CAM & interpretability audit

Generate Grad-CAM heatmaps for every class and visually audit whether the model attends to clinically meaningful regions — or to spurious markers (annotations, equipment).

15 / 19Future work

— THANK YOU

Questions?

CheXVision · BIG D(ATA) Team · Deep Learning & Big Data, AIN program

Code

github.com/arudaev/chexvision

Demo

hf.co/spaces/arudaev/chexvision-demo

Models

hf.co/arudaev/chexvision-{densenet, scratch}

19 / 19Thank you

Radiologists read billions of chest X-rays a year.The bottleneck is attention, not imaging.