PSPNet
Introduction
Abstract
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
Citation
@inproceedings{zhao2017pspnet,
title={Pyramid Scene Parsing Network},
author={Zhao, Hengshuang and Shi, Jianping and Qi, Xiaojuan and Wang, Xiaogang and Jia, Jiaya},
booktitle={CVPR},
year={2017}
}
@article{wightman2021resnet,
title={Resnet strikes back: An improved training procedure in timm},
author={Wightman, Ross and Touvron, Hugo and J{\'e}gou, Herv{\'e}},
journal={arXiv preprint arXiv:2110.00476},
year={2021}
}
Results and models
Cityscapes
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-50-D8 | 512x1024 | 40000 | 6.1 | 4.07 | 77.85 | 79.18 | config | model | log |
PSPNet | R-101-D8 | 512x1024 | 40000 | 9.6 | 2.68 | 78.34 | 79.74 | config | model | log |
PSPNet | R-50-D8 | 769x769 | 40000 | 6.9 | 1.76 | 78.26 | 79.88 | config | model | log |
PSPNet | R-101-D8 | 769x769 | 40000 | 10.9 | 1.15 | 79.08 | 80.28 | config | model | log |
PSPNet | R-18-D8 | 512x1024 | 80000 | 1.7 | 15.71 | 74.87 | 76.04 | config | model | log |
PSPNet | R-50-D8 | 512x1024 | 80000 | - | - | 78.55 | 79.79 | config | model | log |
PSPNet | R-50b-D8 rsb | 512x1024 | 80000 | 6.2 | 3.82 | 78.47 | 79.45 | config | model | log |
PSPNet | R-101-D8 | 512x1024 | 80000 | - | - | 79.76 | 81.01 | config | model | log |
PSPNet (FP16) | R-101-D8 | 512x1024 | 80000 | 5.34 | 8.77 | 79.46 | - | config | model | log |
PSPNet | R-18-D8 | 769x769 | 80000 | 1.9 | 6.20 | 75.90 | 77.86 | config | model | log |
PSPNet | R-50-D8 | 769x769 | 80000 | - | - | 79.59 | 80.69 | config | model | log |
PSPNet | R-101-D8 | 769x769 | 80000 | - | - | 79.77 | 81.06 | config | model | log |
PSPNet | R-18b-D8 | 512x1024 | 80000 | 1.5 | 16.28 | 74.23 | 75.79 | config | model | log |
PSPNet | R-50b-D8 | 512x1024 | 80000 | 6.0 | 4.30 | 78.22 | 79.46 | config | model | log |
PSPNet | R-101b-D8 | 512x1024 | 80000 | 9.5 | 2.76 | 79.69 | 80.79 | config | model | log |
PSPNet | R-18b-D8 | 769x769 | 80000 | 1.7 | 6.41 | 74.92 | 76.90 | config | model | log |
PSPNet | R-50b-D8 | 769x769 | 80000 | 6.8 | 1.88 | 78.50 | 79.96 | config | model | log |
PSPNet | R-101b-D8 | 769x769 | 80000 | 10.8 | 1.17 | 78.87 | 80.04 | config | model | log |
PSPNet | R-50-D32 | 512x1024 | 80000 | 3.0 | 15.21 | 73.88 | 76.85 | config | model | log |
PSPNet | R-50b-D32 rsb | 512x1024 | 80000 | 3.1 | 16.08 | 74.09 | 77.18 | config | model | log |
PSPNet | R-50b-D32 | 512x1024 | 80000 | 2.9 | 15.41 | 72.61 | 75.51 | config | model | log |
ADE20K
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-50-D8 | 512x512 | 80000 | 8.5 | 23.53 | 41.13 | 41.94 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 80000 | 12 | 15.30 | 43.57 | 44.35 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 160000 | - | - | 42.48 | 43.44 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 160000 | - | - | 44.39 | 45.35 | config | model | log |
Pascal VOC 2012 + Aug
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-50-D8 | 512x512 | 20000 | 6.1 | 23.59 | 76.78 | 77.61 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 20000 | 9.6 | 15.02 | 78.47 | 79.25 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 40000 | - | - | 77.29 | 78.48 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 40000 | - | - | 78.52 | 79.57 | config | model | log |
Pascal Context
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-101-D8 | 480x480 | 40000 | 8.8 | 9.68 | 46.60 | 47.78 | config | model | log |
PSPNet | R-101-D8 | 480x480 | 80000 | - | - | 46.03 | 47.15 | config | model | log |
Pascal Context 59
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-101-D8 | 480x480 | 40000 | - | - | 52.02 | 53.54 | config | model | log |
PSPNet | R-101-D8 | 480x480 | 80000 | - | - | 52.47 | 53.99 | config | model | log |
Dark Zurich and Nighttime Driving
We support evaluation results on these two datasets using models above trained on Cityscapes training set.
Method | Backbone | Training Dataset | Test Dataset | mIoU | config | evaluation checkpoint |
---|---|---|---|---|---|---|
PSPNet | R-50-D8 | Cityscapes Training set | Dark Zurich | 10.91 | config | model | log |
PSPNet | R-50-D8 | Cityscapes Training set | Nighttime Driving | 23.02 | config | model | log |
PSPNet | R-50-D8 | Cityscapes Training set | Cityscapes Validation set | 77.85 | config | model | log |
PSPNet | R-101-D8 | Cityscapes Training set | Dark Zurich | 10.16 | config | model | log |
PSPNet | R-101-D8 | Cityscapes Training set | Nighttime Driving | 20.25 | config | model | log |
PSPNet | R-101-D8 | Cityscapes Training set | Cityscapes Validation set | 78.34 | config | model | log |
PSPNet | R-101b-D8 | Cityscapes Training set | Dark Zurich | 15.54 | config | model | log |
PSPNet | R-101b-D8 | Cityscapes Training set | Nighttime Driving | 22.25 | config | model | log |
PSPNet | R-101b-D8 | Cityscapes Training set | Cityscapes Validation set | 79.69 | config | model | log |
COCO-Stuff 10k
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-50-D8 | 512x512 | 20000 | 9.6 | 20.5 | 35.69 | 36.62 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 20000 | 13.2 | 11.1 | 37.26 | 38.52 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 40000 | - | - | 36.33 | 37.24 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 40000 | - | - | 37.76 | 38.86 | config | model | log |
COCO-Stuff 164k
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-50-D8 | 512x512 | 80000 | 9.6 | 20.5 | 38.80 | 39.19 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 80000 | 13.2 | 11.1 | 40.34 | 40.79 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 160000 | - | - | 39.64 | 39.97 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 160000 | - | - | 41.28 | 41.66 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 320000 | - | - | 40.53 | 40.75 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 320000 | - | - | 41.95 | 42.42 | config | model | log |
LoveDA
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-18-D8 | 512x512 | 80000 | 1.45 | 26.87 | 48.62 | 47.57 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 80000 | 6.14 | 6.60 | 50.46 | 50.19 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 80000 | 9.61 | 4.58 | 51.86 | 51.34 | config | model | log |
Potsdam
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-18-D8 | 512x512 | 80000 | 1.50 | 85.12 | 77.09 | 78.30 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 80000 | 6.14 | 30.21 | 78.12 | 78.98 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 80000 | 9.61 | 19.40 | 78.62 | 79.47 | config | model | log |
Vaihingen
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-18-D8 | 512x512 | 80000 | 1.45 | 85.06 | 71.46 | 73.36 | config | model | log |
PSPNet | R-50-D8 | 512x512 | 80000 | 6.14 | 30.29 | 72.36 | 73.75 | config | model | log |
PSPNet | R-101-D8 | 512x512 | 80000 | 9.61 | 19.97 | 72.61 | 74.18 | config | model | log |
iSAID
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
PSPNet | R-18-D8 | 896x896 | 80000 | 4.52 | 26.91 | 60.22 | 61.25 | config | model | log |
PSPNet | R-50-D8 | 896x896 | 80000 | 16.58 | 8.88 | 65.36 | 66.48 | config | model | log |
Note:
-
FP16
means Mixed Precision (FP16) is adopted in training. -
896x896
is the Crop Size of iSAID dataset, which is followed by the implementation of PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation -
rsb
is short for 'Resnet strikes back'. - The
b
inR-50b
means ResNetV1b, which is a standard ResNet backbone. In MMSegmentation, default backbone is ResNetV1c, which usually performs better in semantic segmentation task.