CheXVision · Final Presentation Deep Learning & Big Data · 2026
Large-Scale Chest X-Ray Pathology Detection
CheXVision.
Comparing a custom SE-ResNet trained from scratch against a DenseNet-121 transfer-learning pipeline on 112,120 NIH chest X-rays — under identical evaluation conditions.
Program AIN — Deep Learning & Big Data
Dataset NIH Chest X-ray14 · 112,120 images
Tasks 14-label pathology · Binary Normal/Abnormal
Code github.com/arudaev/chexvision
CheXVision01 · Motivation
The Problem

Radiologists read billions of chest X-rays a year.
The bottleneck is attention, not imaging.

Chest radiography is the most common medical imaging procedure worldwide. Getting reliable, low-cost triage to every patient — especially in under-served regions — is a true ML-for-healthcare problem.

Scale
2B+
Chest X-rays acquired globally each year.
Coverage gap
2/3
Of the world's population lacks timely access to radiology.
The ask
14
Pathologies to be detected simultaneously from a single frontal view.
01 / 18Motivation
CheXVision02 · Dataset
NIH Chest X-ray14

A genuine big-data benchmark — 112,120 frontal X-rays, 14 weak labels.

Images
112,120
Frontal-view chest X-rays · 1,024² source, resized to 320².
Shards
36
Parquet shards on HF Hub · ~7.97 GB total.
Labels
14
Patient-level, noisy — extracted from reports by NLP.
Split
Train
78,468
70%
Validation
11,210
10%
Test
22,442
20%
14 Pathology Classes
AtelectasisCardiomegaly EffusionInfiltration MassNodule PneumoniaPneumothorax ConsolidationEdema EmphysemaFibrosis Pleural ThickeningHernia
Two derived tasks
  • Multi-label — an image may carry zero, one, or several conditions simultaneously.
  • Binary — "No Finding" ↦ Normal, any label ↦ Abnormal (fast triage head).
02 / 18NIH Chest X-ray14
CheXVision03 · Framing
Dual-Task Architecture

One forward pass, two clinically-aligned predictions.

A shared feature extractor feeds two heads — a fast triage decision and a detailed pathology profile. Trained jointly with a combined loss so both objectives shape the representation.

03 / 18Shared backbone · Two heads
CheXVisionSection 1 · Approach
— PART 01
Approach.
Two architectures, one data pipeline, one evaluation protocol. A custom SE-ResNet trained from scratch vs. DenseNet-121 transferred from ImageNet — a fair, head-to-head comparison under identical conditions.
04 / 18Approach
CheXVision04 · Model 1
Model 1 · Built from scratch

Custom SE-ResNet · depth [3, 4, 6, 3] · Kaiming init · no pretraining.

Parameters23M
  • Residual stages — 4 stages, ResNet-50 topology.
  • Squeeze-Excitation — per-block channel attention recalibrates features.
  • Dual heads share a 512-dim global-pooled representation.
  • Purpose — strong baseline trained without outside data.
05 / 18Custom residual CNN
CheXVision05 · Model 2
Model 2 · Transfer Learning (CheXNet)

DenseNet-121 · ImageNet-pretrained · dense connectivity, 7.9M parameters.

Parameters7.9M
  • Dense connectivity — every layer sees all prior feature maps.
  • ImageNet init — strong visual priors before medical fine-tuning.
  • Feature layer 1024 → 512, ReLU, Dropout 0.3.
  • Follows CheXNet (Rajpurkar et al., 2017) — our gold-standard reference.
06 / 18DenseNet-121 transfer
CheXVision06 · Fine-Tuning
Phased Fine-Tuning Schedule

Freeze the backbone first. Unfreeze once the heads have caught up.

Standard transfer-learning hygiene: a high learning rate for fresh heads destroys the pretrained backbone. We split training in two.

Phase 1 · Epochs 1–5
Backbone frozen. Only the new feature layer + two heads train, at lr = 1e-3. Heads get calibrated on medical features without corrupting ImageNet weights.
Phase 2 · Epochs 6–60
End-to-end fine-tuning at lr = 1e-4. Every layer now adapts to chest X-ray statistics; cosine schedule keeps convergence stable.
07 / 18Freeze → Unfreeze
CheXVision07 · Pipeline
End-to-end training pipeline

From HF dataset snapshot to a tagged checkpoint on HF Hub.

Step 01 · Data
snapshot_download pins a Parquet revision; 78,468 / 11,210 / 22,442 split.
Step 02 · Augmentation
CLAHE (LAB space), HFlip, Rotate ±15°, RandomAffine, ColorJitter, GaussianBlur, RandomErasing.
Step 03 · Forward pass
Mixed precision torch.cuda.amp · fp16 — ~2× throughput on T4.
Step 04 · Loss & optimisation
Weighted BCE with pos_weight per class · AdamW · CosineAnnealingLR · grad clip 1.0 · grad accum ×4 → effective batch 96.
Step 05 · Checkpoint + Hub
Best epoch by val macro AUC-ROC; model, config, history & card pushed to HF Hub.
08 / 18Snapshot → Augment → Train → Checkpoint → Hub
CheXVision08 · Infrastructure
The "Big Data" side of the project

A cloud-only workflow: 8 GB of shards, no local GPU, reproducible runs.

Source code
GitHub
CI on every push — ruff + pytest (51 tests) + mypy.
Dataset
HF Hub
36 Parquet shards · 320² · pinned revision for reproducibility.
Training compute
Kaggle Kernels
Free T4 / P100 GPU · 30 h/week · kernels bundle the entire source tree.
Model artifacts · Demo
HF Hub + Spaces
Auto-upload of checkpoints; Streamlit demo deploys on push to main.
Why this matters for "Big Data"
  • Streaming metadata — EDA runs locally without downloading the 8 GB dataset.
  • Sharded Parquet — readers parallelise across shards; disk is never the bottleneck.
  • Mixed precision + grad accumulation — effective batch 96 on a single 16 GB T4.
Reproducibility guardrails
  • Pinned dataset revision44443e6… — same bytes, every run.
  • Kernel payloaddispatch.py embeds the current source tree base64-encoded.
  • Tagged model cards — config + metrics + history.json alongside every checkpoint.
09 / 18GitHub · HF Hub · Kaggle · Streamlit
CheXVisionSection 2 · Results
— PART 02
Results.
Measured on the held-out validation split. Primary metric: macro-averaged AUC-ROC across all 14 pathologies. We also report binary AUC / F1 for the triage head.
10 / 18Results
CheXVision09 · Headline
Headline — best validation metrics

DenseNet-121 matches the CheXNet benchmark. The custom CNN trails by ~3.2 pp.

DenseNet-121 · Transfer
Best @ epoch 18
Macro AUC-ROC
0.8459
Binary AUC-ROC
0.7867
Binary F1
0.6736
Early-stopped — validation loss began diverging after epoch 20 while AUC plateaued.
Custom SE-ResNet · From scratch
Best @ epoch 41
Macro AUC-ROC
0.8141
Binary AUC-ROC
0.7739
Binary F1
0.6587
Both models trained with identical preprocessing (CLAHE + label smoothing) — fully controlled comparison.
Takeaway. Transfer learning wins on every metric, at 3× fewer parameters and 2.3× faster wall-clock convergence (18 vs. 41 epochs). ImageNet priors matter, even for medical imagery.
11 / 18Best validation metrics
CheXVision10 · Training curves
Validation macro AUC-ROC over training

DenseNet saturates fast at epoch 6. The scratch CNN converges at epoch 41.

DenseNet-121 · Transfer
Custom SE-ResNet · Scratch
Best checkpoint
CheXNet benchmark (0.841)
12 / 18Validation macro AUC-ROC
CheXVision11 · Per-class
DenseNet-121 · per-pathology AUC at best epoch

Strong on structural findings. Weakest on the noisiest labels.

Top performers
  • Edema · 0.9255, Hernia · 0.9242, Emphysema · 0.9107 — visually distinctive, well-separated classes.
Trailing
  • Infiltration · 0.7133 — the noisiest label in the NIH schema; radiologist agreement is low to begin with.
  • Pneumonia · 0.7397 — broad definition, overlaps with Consolidation & Infiltration.
Clinical reading

Classes with clear anatomical signatures (cardiomegaly, emphysema) are the easy wins. Diffuse infiltrative patterns are the hard ones — matches radiologists' own reliability.

13 / 1814-class AUC-ROC
CheXVision12 · Benchmark
Benchmark positioning

We match the seminal CheXNet result, at a fraction of the compute.

0.70 0.75 0.80 0.85 0.90 CheXNet benchmark · 0.841 Wang et al. 2017 (original NIH paper) 0.748 CheXVision · SE-ResNet (scratch) 0.8141 CheXNet (Rajpurkar et al., 2017) 0.841 CheXVision · DenseNet-121 (ours) 0.8459
Context. CheXNet (Rajpurkar et al., 2017) established DenseNet-121 as the standard for this dataset. Our pipeline reproduces & slightly exceeds it — on a single free Kaggle GPU, 18 epochs, 7.9M params. The custom scratch model is +6.6 pp over the 2017 NIH paper that published the dataset.
14 / 18Macro AUC-ROC comparison
CheXVision13 · Lesson
What the comparison taught us

Transfer wins — but the scratch model told us why.

Custom SE-ResNet (scratch) DenseNet-121 (transfer)
Parameters ~23 M 7.9 M
Epochs to best 41 18
Macro AUC-ROC 0.8141 0.8459 (+3.2 pp)
Binary AUC / F1 0.7739 / 0.6587 0.7867 / 0.6736
When it fails Rare, diffuse, subtle classes — never sees enough examples with random init. Same failure modes, but with a higher floor — pretrained edges + textures transfer cleanly.
Our read A strong "sanity baseline" and ablation for the ImageNet-prior question. Our strongest model — research-grade. Informative, but not clinically actionable: held-out test and external validation come first.
15 / 18Head-to-head comparison
CheXVision14 · Limitations
What to trust, what not to trust

Honest about the ceiling.

Label quality
  • NIH labels are patient-level, extracted by NLP from reports — not lesion-level.
  • "Infiltration" & "Pneumonia" overlap even between radiologists. Our lowest AUCs are where ground truth itself is noisy.
Evaluation scope
  • AUCs reported are on the validation split. Test set is held for the final report.
  • No external-hospital evaluation. Distribution shift from equipment / preprocessing will degrade performance.
Deployment caveats
  • Not clinical. This is a research / educational system. Predictions do not substitute a radiologist.
  • The binary triage head is the more defensible output — it fails gracefully and can be calibrated to a sensitivity target.
Compute envelope
  • Single free T4 GPU on Kaggle, 30 h/week cap — no hyperparameter sweeps, no cross-validation.
  • Both models trained once each — no repeat runs, so the 1.7 pp gap carries no confidence interval.
16 / 18Limitations
CheXVision15 · Future work
Where this goes next

Three directions, ranked by expected lift.

01 · Test-Time Augmentation + Ensemble
Average 4 views (identity, h-flip, ±7° rotate) per model, then average both models. The two architectures fail on different examples — the ensemble consistently improves macro AUC with zero retraining.
02 · External validation
Evaluate on CheXpert or MIMIC-CXR without fine-tuning. Quantify the true distribution-shift penalty; this is the question any clinical reviewer will ask first.
03 · Grad-CAM & interpretability audit
Generate Grad-CAM heatmaps for every class and visually audit whether the model attends to clinically meaningful regions — or to spurious markers (annotations, equipment).
17 / 18Future work
CheXVisionQ & A
— THANK YOU
Questions?
CheXVision · Deep Learning & Big Data, AIN program
Code
github.com/arudaev/chexvision
Demo
hf.co/spaces/arudaev/chexvision-demo
Models
hf.co/arudaev/chexvision-{densenet, scratch}
18 / 18Thank you