Nvidia Volta - 架構看點

2018-02-19 15:43:20.0

Nvidia CEO黃仁勳的GTC主旨演講淩晨結束，股票大漲，媒體炸鍋。同時，http://devblogs.nvidia.com上刊出一篇文章“Inside Volta: The World’s Most Advanced Data Center GPU”（Inside Volta: The World’s Most Advanced Data Center GPU | Parallel Forall），比較詳細的介紹了Nvidia最新的Volta架構。這篇文章非常值得一讀，推薦大家點擊原文連結好好看看。當然你也可以等一等，估計會出現很多對這篇文章的翻譯。這裡我就不做翻譯了，而是想快速和大家分享一下在架構層面我覺得比較重要的地方，供大家參考。

1. Key Features

按照我們的習慣，首先還是先看一下Volta的關鍵特性，這應該也是Nvidia最引以為豪的地方。這裡做了一些刪減，第一個feature包括了最多硬體架構上的改變，後面我來詳細介紹。

New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU....With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations..... Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
Second-Generation NVLink™ ... supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. ...
HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. ...
Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architectureproviding hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. ...
Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. ...
Cooperative Groups and New Cooperative Launch APIs ....allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. ...
Maximum Performance and Maximum Efficiency Modes ...
Volta Optimized Software ...Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications....

從這些feature來看，Nvidia現在越來越強調對Deep Learning的支持，這些軟硬體feature讓它在datacenter的training方面還是很有優勢的。下面這個圖也對比了GV100（Volta）和之前架構的改進。其中比較有趣的點包括，增加了Tensor Core和相應的性能指標；晶片巨大的面積815mm和先進工藝12nm FFN。

Nvidia的大殺器使用了12nm的工藝，晶片面積還是達到815mm。是否良率能夠滿足要求呢？轉念一想，也許Nvidia並不在乎這個，反正晶片賣天價也有的是人搶。

2. Tensor Cores

當然，Volta架構中最吸引眼球的地方就是新增的Tensor Cores。下圖說明瞭它的功能。

從這裡的公式和圖示來看，Volta中的一個Tensor Core可以進行4x4矩陣的成累加運算。

“Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. ”

具體到硬體架構，文章給出了一個簡化的圖示。

根據我們之前對Google TPU的分析（脈動陣列 - 因Google TPU獲得新生），這裡的tensor core和Google TPU的脈動陣列PE（cell）相比，主要差別在於：1. Tensor Core細微性要大很多，TPU的每個Cell只完成一個標量成累加；2. Tensor Core的精度高，TPU的Cell只支援8bit和16bit定點數操作。畢竟TPU只是面向data center的inference應用的。再放一下我畫的TPU cell猜想圖。

從並存執行的角度來看，多個Tensor Core可以被一個“warp”同時使用：

“During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores.”

下圖是Tensor Core在一個SM中的位置。

Tensor Core確實是專門為Deep Learning設計的，不過這麼大的運算細微性能否很好的和實際的應用匹配呢？相信Nvidia的選擇一定有他的道理。

3. Independent Thread Scheduling

Volta的另一個重要架構改動是所謂獨立的執行緒調度。其實，和增加了Tensor Cores相比，我覺得這個改動是架構上更大的動作。在Volta之前，Nvidia一直使用的SIMT（Single Instruction, Multiple Threads）架構最重要的特徵就是一個“warp”中的32個執行緒是共用一個PC（Program Counter）和棧（Stack）的。而在Volta中，則每個執行緒有了自己的PC和Stack。如下圖所示。

如果你不瞭解SIMT是怎麼回事，你可以找找Nvidia CUDA的介紹作為參考。簡單來說，Nvidia的SIMT就是一個處理器上同時並行運行多個執行緒（thread），但這些執行緒執行相同的一段程式碼，處理不同的資料。

由於Volta之前的架構中多個執行緒共用PC和Stack，在執行緒調度的時候細微性是比較粗的。比如下面這種情況。當出現分支的適合，對於threadIdx < 4的執行緒就要執行A，B；否則是X，Y。Pascal的一個“warp”的32個執行緒共用一個PC，並結合一個“活動遮罩”（active mask），指定任何給定時間哪個執行緒是活動的。這意味著不同的程式分支使某些執行緒無效。“warp”的不同部分循序執行，直到再次收斂，此時掩模被恢復，所有執行緒再次一起運行。

在volta中，情況則發生了變化，如下圖所示，可以實現不同的調度方式，程式中if和else分支的語句現在可以及時交錯。當然，這裡程式的執行還是SIMT方式，即在任何給定的時鐘週期，warp中的所有活動執行緒執行相同的指令，從而保留以前架構的執行效率。

從這裡可以看出，獨立的執行緒調度的重點是實現了更細緻的調度細微性，當然代價是增加了很多硬體開銷（每個thread都要有自己的PC和Stack）。

給每個thread一個獨立的PC和Stack，這個代價可不小。Independent Thread Scheduling是不是能夠發揮最大的作用，軟體工具的作用至關重要。這些年Nvidia在軟體工具上下足了功夫，應該是很有信心吧。

4. 其它一些改進

除了上述兩個我認為最重要的改進之外，Volta架構還有下面幾個改進值得注意。

首先，是能夠同時執行FP32和INT32指令：

“Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.”

其次，是增強L1 Data Cache和Shared Memory，可以支援更靈活的cache和share memory的使用，可以充分利用shared memory在性能上的優勢（當然需要程式師自己管理）。

第三，是文章中介紹的“STARVATION-FREE ALGORITHMS”。這個我沒有太多感覺，大家感興趣的話可以看看原文。

暫時就這麼多了，歡迎大家留言和我交流。

文章來源：知乎