使用Titan RTX服務器重現Fast.ai/DIUx imagenet18

 2019-01-31 07:22:23.0

最後,作為DAWN基準測試的一部分,Fast.ai贏得了第一個ImageNet培訓成本挑戰。他們的定制ResNet50 使用AWS p3.16xlarge(8 x V100 GPU)需要3.27小時才能達到93%的前5精度。今年,Fast.ai與DIUx合作,用16台p3.16xlarge機器將訓練時間減少到18分鐘。這是宣佈時(2018年9月)最快的解決方案。

在本博客中,我們使用單個8圖靈GPU(Titan RTX)服務器重現最新的Fast.ai/DIUx的ImageNet結果。它需要2.36小時才能達到93%的前5精度。


Epoch Training Time (hour) Top-1 Acc Top-5 Acc
1 0.0539 7.2800 19.1979
2 0.0921 18.3619 39.6699
3 0.1306 26.1700 50.2779
4 0.1691 29.9260 54.5460
5 0.2078 32.3339 58.0260
6 0.2465 27.2560 50.7680
7 0.2852 30.0799 54.3160
8 0.3240 39.1959 65.4260
9 0.3627 42.8860 69.2040
10 0.4014 45.1940 70.9100
11 0.4402 49.3839 74.9639
12 0.4788 54.9459 79.5660
13 0.5174 58.5820 81.8980
14 0.6433 57.3959 81.5960
15 0.7569 53.0480 77.7799
16 0.8703 58.9599 82.7979
17 0.9845 60.4039 83.8259
18 1.0982 62.3779 84.8720
19 1.2124 64.9540 86.6080
20 1.3258 65.9520 87.2919
21 1.4390 68.3700 88.7060
22 1.5529 71.4420 90.4820
23 1.6673 72.0479 90.6679
24 1.7826 72.8679 91.1559
25 1.8957 73.5739 91.4960
26 2.1455 75.8519 92.9879
27 2.3657 75.9179 93.0179

You can jump to the code and the instructions from here.

漸進式訓練:Fast.ai/DIUx團隊採用的另一項有趣技術是使用多種分辨率的圖像進行漸進式訓練。培訓以低分辨率(128 x 128)開始,用於輸入圖像和更大的批量大小,以快速達到一定的準確度; 然後它增加了分辨率(首先是244 x 244,然後是288 x 288),用於昂貴的微調。這允許總體上更少的時期以實現目標測試準確度。請注意,這只能通過全局池層替換完全連接的層來實現,因此使用低分辨率圖像訓練的網絡可以使用更高分辨率的圖像而無需修改。與此同時,為每個分辨率仔細安排批量大小和學習率,以盡快獲得所需的性能。



You can reproduce the results with this repo.

First, clone the repo and setup a Python 3 virtual environment:

git clone https://github.com/lambdal/imagenet18.gitcd imagenet18

virtualenv -p python3 envsource env/bin/activate

pip install -r requirements_local.txt

Then download the data to your local machine (be aware that the tar files are about 200 GB in total):

wget https://s3.amazonaws.com/yaroslavvb/imagenet-data-sorted.tar

wget https://s3.amazonaws.com/yaroslavvb/imagenet-sz.tar

tar -xvf imagenet-data-sorted.tar -C /mnt/data/data

tar -xvf imagenet-sz.tar -C /mnt/data/data

Finally run the following command to reproduce the results on a 8-GPU server. Set the "nproc_per_node" to match the number of GPUs on your machine.

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 \
training/train_imagenet_nv.py /mnt/data/data/imagenet \
--fp16 --logdir ./ncluster/runs/lambda-blade --distributed --init-bn0 --no-bn-wd \
--phases "[{'ep': 0, 'sz': 128, 'bs': 512, 'trndir': '-sz/160'}, {'ep': (0, 7), 'lr': (1.0, 2.0)}, {'ep': (7, 13), 'lr': (2.0, 0.25)}, {'ep': 13, 'sz': 224, 'bs': 224, 'trndir': '-sz/320', 'min_scale': 0.087}, {'ep': (13, 22), 'lr': (0.4375, 0.043750000000000004)}, {'ep': (22, 25), 'lr': (0.043750000000000004, 0.004375)}, {'ep': 25, 'sz': 288, 'bs': 128, 'min_scale': 0.5, 'rect_val': True}, {'ep': (25, 28), 'lr': (0.0025, 0.00025)}]" 

To print out the statics, locate the events.out file in the "logdir" folder and simply run this command:

python dawn/prepare_dawn_tsv.py \ --events_path=<logdir>/<events.out>
