CheXVision · Final Presentation Deep Learning & Big Data · 2026
Large-Scale Chest X-Ray Pathology Detection
CheXVision.
Comparing a custom SE-ResNet trained from scratch against a DenseNet-121 transfer-learning pipeline on 112,120 NIH chest X-rays — under identical evaluation conditions.
Team BIG D(ATA) — AIN Program
Dataset NIH Chest X-ray14 · 112,120 images
Tasks 14-label pathology · Binary Normal/Abnormal
Code github.com/arudaev/chexvision
CheXVision01 · Motivation
The Problem

Radiologists read billions of chest X-rays a year.
The bottleneck is attention, not imaging.

Chest radiography is the most common medical imaging procedure worldwide. Getting reliable, low-cost triage to every patient — especially in under-served regions — is a true ML-for-healthcare problem.

Scale
2B+
Chest X-rays acquired globally each year.
Coverage gap
2/3
Of the world's population lacks timely access to radiology.
The ask
14
Pathologies to be detected simultaneously from a single frontal view.
01 / 19Motivation
CheXVision02 · Dataset
NIH Chest X-ray14

A genuine big-data benchmark — 112,120 frontal X-rays, 14 weak labels.

Images
112,120
Frontal-view chest X-rays · 1,024² source, resized to 320².
Shards
36
Parquet shards on HF Hub · ~7.97 GB total.
Labels
14
Patient-level, noisy — extracted from reports by NLP.
Split
Train
78,468
70%
Validation
11,210
10%
Test
22,442
20%
14 Pathology Classes
AtelectasisCardiomegaly EffusionInfiltration MassNodule PneumoniaPneumothorax ConsolidationEdema EmphysemaFibrosis Pleural ThickeningHernia
Two derived tasks
  • Multi-label — an image may carry zero, one, or several conditions simultaneously.
  • Binary — "No Finding" ↦ Normal, any label ↦ Abnormal (fast triage head).
02 / 19NIH Chest X-ray14
CheXVision03 · Framing
Dual-Task Architecture

One forward pass, two clinically-aligned predictions.

A shared feature extractor feeds two heads — a fast triage decision and a detailed pathology profile. Trained jointly with a combined loss so both objectives shape the representation.

Input 3 × 320 × 320 RGB Shared Backbone SE-ResNet or DenseNet-121 Conv layers → Global Avg Pool 512-dim feature vector Multi-Label Head Linear 512 → 14 · sigmoid Weighted BCE · 14 pathologies Binary Head Linear 512 → 1 · sigmoid BCE · Normal vs. Abnormal Combined Loss 1.0 · ℒ_ml + 0.5 · ℒ_bin
03 / 19Shared backbone · Two heads
CheXVisionSection 1 · Approach
— PART 01
Approach.
Two architectures, one data pipeline, one evaluation protocol. A custom SE-ResNet trained from scratch vs. DenseNet-121 transferred from ImageNet — a fair, head-to-head comparison under identical conditions.
04 / 19Approach
CheXVision04 · Model 1
Model 1 · Built from scratch

Custom SE-ResNet · depth [3, 4, 6, 3] · Kaiming init · no pretraining.

Input 3 × 320 × 320 Stem 7×7 Conv BN · ReLU 3→64ch MaxPool ÷2 Stage 1 3× SE-ResBlock 64ch Stage 2 ↓½ 4× SE-ResBlock 128ch Stage 3 ↓½ 6× SE-ResBlock 256ch Stage 4 ↓½ 3× SE-ResBlock 512ch Global Avg Pool Dropout(0.5) 512-dim Multilabel Head Linear 512→14 sigmoid · 14 pathologies Binary Head Linear 512→1 sigmoid · Normal/Abnormal
Parameters23M
  • Residual stages — 4 stages, ResNet-50 topology.
  • Squeeze-Excitation — per-block channel attention recalibrates features.
  • Dual heads share a 512-dim global-pooled representation.
  • Purpose — strong baseline trained without outside data.
04 / 19Custom residual CNN
CheXVision05 · Model 2
Model 2 · Transfer Learning (CheXNet)

DenseNet-121 · ImageNet-pretrained · dense connectivity, 7.9M parameters.

Input 3 × 320 × 320 DenseNet-121 Backbone ImageNet pretrained Dense connectivity 7.9M parameters Adaptive Avg Pool 1024-dim features Feature Layer Linear 1024→512 ReLU Dropout(0.3) Multilabel Head Linear 512→14 sigmoid · 14 pathologies Binary Head Linear 512→1 sigmoid · Normal/Abnormal
Parameters7.9M
  • Dense connectivity — every layer sees all prior feature maps.
  • ImageNet init — strong visual priors before medical fine-tuning.
  • Feature layer 1024 → 512, ReLU, Dropout 0.3.
  • Follows CheXNet (Rajpurkar et al., 2017) — our gold-standard reference.
05 / 19DenseNet-121 transfer
CheXVision06 · Fine-Tuning
Phased Fine-Tuning Schedule

Freeze the backbone first. Unfreeze once the heads have caught up.

Standard transfer-learning hygiene: a high learning rate for fresh heads destroys the pretrained backbone. We split training in two.

Phase 1 Epochs 1–5 Backbone frozen Train heads only lr = 0.001 Epoch 6 unfreeze_backbone() Phase 2 Epochs 6–60 End-to-end fine-tuning All layers trainable lr = 0.0001
Phase 1 · Epochs 1–5
Backbone frozen. Only the new feature layer + two heads train, at lr = 1e-3. Heads get calibrated on medical features without corrupting ImageNet weights.
Phase 2 · Epochs 6–60
End-to-end fine-tuning at lr = 1e-4. Every layer now adapts to chest X-ray statistics; cosine schedule keeps convergence stable.
06 / 19Freeze → Unfreeze
CheXVision07 · Pipeline
End-to-end training pipeline

From HF dataset snapshot to a tagged checkpoint on HF Hub.

arudaev/chest-xray-14-320 112,120 images · 36 shards ~7.97 GB snapshot_download data/images · data/labels.csv train 78,468 · val 11,210 test 22,442 Augmentation Pipeline CLAHE · HFlip · Rotate±15° RandomAffine · ColorJitter GaussianBlur · RandomErasing ⚡ Model Forward Pass torch.cuda.amp.autocast · fp16 multilabel_logits B×14 WeightedBCE + pos_weight 14 classes binary_logits B×1 BCE · Normal vs. Abnormal Combined Loss 1.0 × multilabel + 0.5 × binary Backward · Grad Clip 1.0 Grad Accum ×4 · eff. batch 96 AdamW · CosineAnnealingLR early stop patience = 15 ↑ best val AUC-ROC 💾 Best Checkpoint → HF Hub model · config · history.json
Step 01 · Data
snapshot_download pins a Parquet revision; 78,468 / 11,210 / 22,442 split.
Step 02 · Augmentation
CLAHE (LAB space), HFlip, Rotate ±15°, RandomAffine, ColorJitter, GaussianBlur, RandomErasing.
Step 03 · Forward pass
Mixed precision torch.cuda.amp · fp16 — ~2× throughput on T4.
Step 04 · Loss & optimisation
Weighted BCE with pos_weight per class · AdamW · CosineAnnealingLR · grad clip 1.0 · grad accum ×4 → effective batch 96.
Step 05 · Checkpoint + Hub
Best epoch by val macro AUC-ROC; model, config, history & card pushed to HF Hub.
07 / 19Snapshot → Augment → Train → Checkpoint → Hub
CheXVision08 · Infrastructure
The "Big Data" side of the project

A cloud-only workflow: 8 GB of shards, no local GPU, reproducible runs.

Source code
GitHub
CI on every push — ruff + pytest (51 tests) + mypy.
Dataset
HF Hub
36 Parquet shards · 320² · pinned revision for reproducibility.
Training compute
Kaggle Kernels
Free T4 / P100 GPU · 30 h/week · kernels bundle the entire source tree.
Model artifacts · Demo
HF Hub + Spaces
Auto-upload of checkpoints; Streamlit demo deploys on push to main.
Why this matters for "Big Data"
  • Streaming metadata — EDA runs locally without downloading the 8 GB dataset.
  • Sharded Parquet — readers parallelise across shards; disk is never the bottleneck.
  • Mixed precision + grad accumulation — effective batch 96 on a single 16 GB T4.
Reproducibility guardrails
  • Pinned dataset revision44443e6… — same bytes, every run.
  • Kernel payloaddispatch.py embeds the current source tree base64-encoded.
  • Tagged model cards — config + metrics + history.json alongside every checkpoint.
08 / 19GitHub · HF Hub · Kaggle · Streamlit
CheXVisionSection 2 · Results
— PART 02
Results.
Measured on the held-out validation split. Primary metric: macro-averaged AUC-ROC across all 14 pathologies. We also report binary AUC / F1 for the triage head.
09 / 19Results
CheXVision09 · Headline
Headline — best validation metrics

DenseNet-121 matches the CheXNet benchmark. The custom CNN trails by ~4.5 pp.

DenseNet-121 · Transfer
Best @ epoch 18
Macro AUC-ROC
0.8459
Binary AUC-ROC
0.7867
Binary F1
0.6736
Early-stopped — validation loss began diverging after epoch 20 while AUC plateaued.
Custom SE-ResNet · From scratch
Best @ epoch 60
Macro AUC-ROC
0.8008
Binary AUC-ROC
0.7571
Binary F1
0.6474
Still climbing slowly at epoch 75 — gap is about compute budget, not ceiling.
Takeaway. Transfer learning wins on every metric, at 3× fewer parameters and 3× faster wall-clock convergence (18 vs. 60+ epochs). ImageNet priors matter, even for medical imagery.
09 / 19Best validation metrics
CheXVision10 · Training curves
Validation macro AUC-ROC over training

DenseNet saturates fast at epoch 6. The scratch CNN grinds upward for 60 epochs.

DenseNet-121 · Transfer
Custom SE-ResNet · Scratch
Best checkpoint
CheXNet benchmark (0.841)
10 / 19Validation macro AUC-ROC
CheXVision11 · Per-class
DenseNet-121 · per-pathology AUC at best epoch

Strong on structural findings. Weakest on the noisiest labels.

Top performers
  • Edema · 0.9255, Hernia · 0.9242, Emphysema · 0.9107 — visually distinctive, well-separated classes.
Trailing
  • Infiltration · 0.7133 — the noisiest label in the NIH schema; radiologist agreement is low to begin with.
  • Pneumonia · 0.7397 — broad definition, overlaps with Consolidation & Infiltration.
Clinical reading

Classes with clear anatomical signatures (cardiomegaly, emphysema) are the easy wins. Diffuse infiltrative patterns are the hard ones — matches radiologists' own reliability.

11 / 1914-class AUC-ROC
CheXVision12 · Benchmark
Benchmark positioning

We match the seminal CheXNet result, at a fraction of the compute.

0.70 0.75 0.80 0.85 0.90 CheXNet benchmark · 0.841 Wang et al. 2017 (original NIH paper) 0.748 CheXVision · SE-ResNet (scratch) 0.8008 CheXNet (Rajpurkar et al., 2017) 0.841 CheXVision · DenseNet-121 (ours) 0.8459
Context. CheXNet (Rajpurkar et al., 2017) established DenseNet-121 as the standard for this dataset. Our pipeline reproduces & slightly exceeds it — on a single free Kaggle GPU, 18 epochs, 7.9M params. The custom scratch model is +5.3 pp over the 2017 NIH paper that published the dataset.
12 / 19Macro AUC-ROC comparison
CheXVision13 · Lesson
What the comparison taught us

Transfer wins — but the scratch model told us why.

Custom SE-ResNet (scratch) DenseNet-121 (transfer)
Parameters ~23 M 7.9 M
Epochs to best 60+ (still climbing) 18
Macro AUC-ROC 0.8008 0.8459 (+4.5 pp)
Binary AUC / F1 0.7571 / 0.6474 0.7867 / 0.6736
When it fails Rare, diffuse, subtle classes — never sees enough examples with random init. Same failure modes, but with a higher floor — pretrained edges + textures transfer cleanly.
Our read A strong "sanity baseline" and ablation for the ImageNet-prior question. Production-grade. We'd ship this if the project had a clinical partner.
13 / 19Head-to-head comparison
CheXVision14 · Limitations
What to trust, what not to trust

Honest about the ceiling.

Label quality
  • NIH labels are patient-level, extracted by NLP from reports — not lesion-level.
  • "Infiltration" & "Pneumonia" overlap even between radiologists. Our lowest AUCs are where ground truth itself is noisy.
Evaluation scope
  • AUCs reported are on the validation split. Test set is held for the final report.
  • No external-hospital evaluation. Distribution shift from equipment / preprocessing will degrade performance.
Deployment caveats
  • Not clinical. This is a research / educational system. Predictions do not substitute a radiologist.
  • The binary triage head is the more defensible output — it fails gracefully and can be calibrated to a sensitivity target.
Compute envelope
  • Single free T4 GPU on Kaggle, 30 h/week cap — no hyperparameter sweeps, no cross-validation.
  • Scratch model did not converge within budget; more epochs may narrow the gap but not close it.
14 / 19Limitations
CheXVision15 · Future work
Where this goes next

Three directions, ranked by expected lift.

01 · Test-Time Augmentation + Ensemble
Average 4 views (identity, h-flip, ±7° rotate) per model, then average both models. The two architectures fail on different examples — the ensemble consistently improves macro AUC with zero retraining.
02 · External validation
Evaluate on CheXpert or MIMIC-CXR without fine-tuning. Quantify the true distribution-shift penalty; this is the question any clinical reviewer will ask first.
03 · Grad-CAM & interpretability audit
Generate Grad-CAM heatmaps for every class and visually audit whether the model attends to clinically meaningful regions — or to spurious markers (annotations, equipment).
15 / 19Future work
CheXVisionQ & A
— THANK YOU
Questions?
CheXVision · BIG D(ATA) Team · Deep Learning & Big Data, AIN program
Code
github.com/arudaev/chexvision
Demo
hf.co/spaces/arudaev/chexvision-demo
Models
hf.co/arudaev/chexvision-{densenet, scratch}
19 / 19Thank you