# Accelerating Al with



Asher Fredman, Solution Architect

#### TALK AGENDA

- What and Why NVIDIA
- GPUs Vs. CPUs the power parallel computing
- Introduce the CUDA programming model
- GPU architecture and how to utilize GPU capabilities
- GPU acceleration in DL
  - Matrix Multiplications
  - Tensor Cores and AMP
  - Inference Optimizations and Sparsity



#### NVIDIA From Computer Graphics to GPU Computing



#### THE BIG BANG IN AI



#### DEEP LEARNING REVOLUTIONIZING COMPUTING

Image Classification, Object Detection, Localization, Action Recognition



#### Pedestrian Detection, Lane Detection, Traffic Sign Recognition



Speech Recognition, Speech Translation, Natural Language Processing



Breast Cancer Cell Mitosis Detection, Volumetric Brain Image Segmentation





# GPU and CPU

#### POWERING ALL INDUSTRIES

With a single innovation...

**CPU** Optimized for Serial Tasks

#### GPU Accelerator Optimized for

Optimized for Parallel Tasks



#### SMALL CHANGES, BIG SPEED-UP

**Application Code** 



A SUPERCHARGED COMPUTING MODEL To power the next advances in technology...





### NVIDIA END TO END AI PLATFORM



#### BUILD AI FASTER. DEPLOY ANYWHERE WITH NGC ngc.nvidia.com



#### **CONTINUOUS PERFORMANCE IMPROVEMENT**

Developers' Software Optimizations Deliver Better Performance on the Same Hardware



512 Batch Size for TF & PyT, 256 Batch size for MxNet | ResNet-50 Training v1.5| 16x V100 | DGX-2



### **CUDA C/C++ BASICS**



### What is CUDA?



A general-purpose parallel computing platform and programming model.

- General purpose one ring to rule them all
- Parallel computing via minimal extensions to familiar environments
- GPU abstractions to optimize code using HW capabilities

#### **3 WAYS TO ACCELERATE APPLICATIONS**



### Introduction to CUDA C/C++



#### What will you learn in this section?

- Start with vector addition
- Write and launch CUDA C/C++ kernels
- Manage GPU memory

## **Heterogeneous Computing**



#### Terminology:

- Host The CPU and its memory (host memory)
- Device The GPU and its memory (device memory)



### **Simple Processing Flow**





### **Simple Processing Flow**





### **Simple Processing Flow**





- Copy input data from CPU memory to GP memory
- 2. Load GPU program and execute, caching data on chip for performance
- 3. Copy results from GPU memory to CPU memory



## Parallel SAXPY

- GPU computing is about massive parallelism!
- We need an interesting example...
- SAXPY stands for "Single-Precision A·X Plus Y".





 $z = \alpha x + y$ 

x, y, z: vector  $\alpha$ : scalar

## **CUDA KERNEL EXECUTION**



### Grid Block Thread



### **CUDA code**

saxyp\_serial(N, 2.0, d\_x, d\_y);

```
void saxpy_serial(int n, float a, float *x, float *y)
{
   for (int i = 0; i < n; ++i)
     y[i] = a*x[i] + y[i];
}</pre>
```

```
_global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}
Parallel C Code</pre>
```

saxyp\_parallel<<<n\_blocks,n\_threads>>>(N, 2.0, d\_x, d\_y);
N = n\_blocks x n\_threads

# Indexing Arrays with Blocks and Threads



- No longer as simple as using blockIdx.x and threadIdx.x
  - Consider indexing an array with one element per thread (8 threads/block)



### Why Bother with Blocks of Threads?



#### Blocks seem unnecessary

- They add a level of complexity
- What do we gain?

#### • Unlike parallel blocks, threads have mechanisms to:

- Communicate
- Synchronize

#### See stencil computations for an example





- Consider applying a 1D stencil to a 1D array of elements
  - Each output element is the sum of input elements within a radius
- If radius is 3, then each output element is the sum of 7 input elements:







- Consider applying a 1D stencil to a 1D array of elements
  - Each output element is the sum of input elements within a radius
- If radius is 3, then each output element is the sum of 7 input elements:









- Launching parallel kernels on device \_\_global\_\_
  - Launch N copies of add() with add<<<N/M,M>>>>(...);
  - Use blockIdx.x to access block index
  - Use threadIdx.x to access thread index within block

### Handling Arbitrary Vector Sizes



- Typical problems are not friendly multiples of blockDim.x
- Avoid accessing beyond the end of the arrays:

```
global___void add(int *a, int *b, int *c, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
        c[index] = a[index] + b[index];</pre>
```

Update the kernel launch: add<<< (N + M-1) / M,M>>> (d\_a, d\_b, d\_c, N);

}

### **GPU Architectures**

### and CUDA Optimization



# 20-Series Architecture (Fermi)



512 Scalar Processor (SP) cores execute parallel thread instructions



**16 Streaming Multiprocessors (SMs)** each contains **32 scalar processors** 32 fp32 / int32 ops / clock, 16 fp64 ops / clock 4 Special Function Units (SFUs) Shared register file (128KB) 48 KB / 16 KB Shared memory 16KB / 48 KB L1 data cache

6 GB of DRAM



### Pascal/Volta cc6.0/7.0

- 64 SP units ("cores")
- 32 DP units
- LD/ST units
- FP16 @ 2x SP rate
- cc7.0: TensorCore
- 4 warp schedulers
- Each warp scheduler is dualissue capable
- P100: 50 SM's, 16GB
- V100: 80 SM's, 16/32GB

| М                             |           | _          |            |                                                        |                                |                               |                               |           |           |           |           |                  | •              |
|-------------------------------|-----------|------------|------------|--------------------------------------------------------|--------------------------------|-------------------------------|-------------------------------|-----------|-----------|-----------|-----------|------------------|----------------|
|                               |           |            |            |                                                        |                                | L1 Instruc                    | tion Cache                    |           |           |           |           |                  |                |
|                               | -         | LO Ir      | tstruc     | tion C                                                 | ache                           |                               |                               | -         | LON       | istruc    | tion C    | ache             |                |
|                               | -         | hread/clk) |            | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk) |                                |                               |                               |           |           |           |           |                  |                |
|                               | spatcl    | h Unit     | (32 th     | read/clk)                                              |                                | Dispatch Unit (32 thread/clk) |                               |           |           |           |           |                  |                |
|                               | Reg       | ister      | File (1    | 16,384                                                 | 4 x 32-bit)                    |                               |                               | Reg       | ister     | File ('   | 16,384    | 4 x 32-bit)      |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   | TENSOR<br>CORE                 | TENSOR<br>CORE                | FP64                          | INT       | INT       | FP32      | FP32      | TENSOR<br>CORE   | TENSOR<br>CORE |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
|                               |           |            |            |                                                        |                                | <b></b>                       |                               | 1000      |           |           |           |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| LD/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST  | LD/<br>ST  | LD/<br>ST                                              | LD/ LD/<br>ST ST               | SFU                           | LD/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/ LD/<br>ST ST | SFU            |
|                               | _         | L0 Ir      | struc      | tion C                                                 | ache                           |                               |                               |           | L0 lr     | nstruc    | tion C    | ache             |                |
|                               |           |            | hread/clk) | _                                                      | Warp Scheduler (32 thread/clk) |                               |                               |           |           |           |           |                  |                |
| Dispatch Unit (32 thread/clk) |           |            |            |                                                        |                                |                               | Dispatch Unit (32 thread/clk) |           |           |           |           |                  |                |
|                               | Reg       | ister      | File (1    | 16,384                                                 | 4 x 32-bit)                    |                               |                               | Reg       | jister    | File ('   | 16,384    | 4 x 32-bit)      |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                | 10 01 00 111<br>12 23 25 111  | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   | TENSOR<br>CORE                 | TENSOR<br>CORE                | FP64                          | INT       | INT       | FP32      | FP32      | TENSOR           | TENSOR<br>CORE |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      | CORE             |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| FP64                          | INT       | INT        | FP32       | FP32                                                   |                                |                               | FP64                          | INT       | INT       | FP32      | FP32      |                  |                |
| LD/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST  | LD/<br>ST  | LD/<br>ST                                              | LD/ LD/<br>ST ST               | SFU                           | LD/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/ LD/<br>ST ST | SFU            |
| Concernent In concernent      |           |            |            |                                                        | 1286                           | 3 L1 Data Car                 | he / Shared M                 | emory     | 1         |           |           |                  |                |
|                               |           |            |            |                                                        |                                |                               |                               |           |           |           |           |                  |                |



# **Thread Hierarchy and Execution Model**



#### Software





Hardware

Threads are executed by scalar processors

Thread blocks are executed on multiprocessors

Thread Block



Multiprocessor

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)





A kernel is launched as a grid of thread blocks

# Warps





A thread block consists of 32-thread warps

A warp is executed physically in parallel (SIMD) on a multiprocessor

## **Execution Model Ampere**





# Memory Hierarchy



## Memory model







# Memory hierarchy in GPUs



# Launch Configuration





# Hiding Latency - Launch Configuration

### Key to understanding:

- Instructions are issued in order
- A thread stalls when one of the operands isn't ready:
  - Memory read by itself doesn't stall execution
- Latency is hidden by switching threads
  - GMEM latency: ~400 cycles
- How many threads/threadblocks to launch?
- Conclusion:
  - Need enough threads to hide latency

# **GPU Latency Hiding**



- In CUDA C source code:
- int idx = threadIdx.x+blockDim.x\*blockIdx.x;
- o c[idx] = a[idx] \* b[idx];
- In machine code:
- I0: LD R0, a[idx];
- I1: LD R1, b[idx];
- I2: MPY R2,R0,R1













## Launch Configuration: Summary



- Need enough total threads to keep GPU busy
  - Typically, you'd like 512+ threads per SM (aim for 2048 maximum "occupancy")
    - More if processing one fp32 element per thread
  - Of course, exceptions exist
- Threadblock configuration
  - Threads per block should be a multiple of warp size (32)
  - SM can concurrently execute up to 16 thread blocks
    - Really small thread blocks prevent achieving good occupancy
    - Really large thread blocks are less flexible
    - Generally, use 128-256 threads/block, but use whatever is best for the application
- For more details:
  - Vasily Volkov's GTC2010 talk "Better Performance at Lower Occupancy" (http://www.nvidia.com/content/gtc-2010/pdfs/2238\_gtc2010.pdf)

## EFFICIENT GEMM IMPLEMENTATIONS

### **GENERAL MATRIX PRODUCT**

**Basic definition** 

General matrix product

 $C = \alpha \operatorname{op}(A) * \operatorname{op}(B) + \beta C$ 

C is M-by-N, op(A) is M-by-K, op(B) is K-by-N

Compute independent dot products

```
// Independent dot products
for (int i = 0; i < M; ++i)
    for (int j = 0; j < N; ++j)
        for (int k = 0; k < K; ++k)
            C[i][j] += A[i][k] * B[k][j];</pre>
```

Inefficient due to large working sets to hold parts of A and B



### **GENERAL MATRIX PRODUCT**

#### Accumulated outer products

General matrix product

 $C = \alpha \operatorname{op}(A) * \operatorname{op}(B) + \beta C$ 

C is M-by-N, op(A) is M-by-K, op(B) is K-by-N

Compute independent dot products



#### 

Load elements of **A** and **B** exactly once

#### CUTLASS

CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALE

### **COMPLETE GEMM HIERARCHY**

Data reuse at each level of the memory hierarchy



## ACCELERATING TRAINING AND INFERENCING

### DEEP LEARNING APPLICATION DEVELOPMENT



DEEP LEARNING TRAINING WITH NVIDIA GPUS

# AMP AUTOMATIC MIXED PRECISION

### THE IMPORTANCE OF FP32



### NEW TF32 TENSOR CORES ON A100

#### 20X Higher FLOPS for AI, Zero Code Change

NVIDIA V100 FP32

NVIDIA A100 Tensor Core TF32 with Sparsity



Works like FP32 for AI with Range of FP32 and Precision of FP16

**DNN Sparsity Matrix** 

### AMP

#### Utilizing tensor cores with 3 lines of code



### AMP

#### **Automatic Mixed Precision**



# ENABLING AUTOMATIC MIXED PRECISION

#### Add Just A Few Lines of Code

PyTorch

• Two steps: initialization and wrapping backpropagation

```
from apex import amp
model = ...
optimizer = SomeOptimizer(model.parameters(), ...)
# ...
model, optimizer = amp.initialize(model, optimizer, opt_level="01")
# ...
for train_loop():
   loss = loss_fn(model(x), y)
   with amp.scale_loss(loss, optimizer) as scaled_loss:
      scaled_loss.backward()
   # Can manipulate the .grads if you'd like
   optimizer.step()
```

### NVIDIA DLPROF



### MULTIPLY-ADD OPERATIONS PER CLOCK PER SM

|                     | CUDA Cores |      |      |      | Tensor Cores |      |      |      |      |       |
|---------------------|------------|------|------|------|--------------|------|------|------|------|-------|
| NVIDIA Architecture | FP64       | FP32 | FP16 | INT8 | FP64         | TF32 | FP16 | INT8 | INT4 | INT1  |
| Volta               | 32         | 64   | 128  | 256  |              |      | 512  |      |      |       |
| Turing              | 2          | 64   | 128  | 256  |              |      | 512  | 1024 | 2048 | 8192  |
| Ampere (A100)       | 32         | 64   | 256  | 256  | 64           | 512  | 1024 | 2048 | 4096 | 16384 |
| Ampere, sparse      |            |      |      |      |              | 1024 | 2048 | 4096 | 8192 |       |

### **TF32 NUMERICAL REPRESENTATIONS**





DEEP LEARNING INFERENCE WITH NVIDIA GPUS

#### HUANG'S LAW



69 📀 NVIDIA.



## TENSORRT

Optimizations



## **KERNEL FUSION**

- Improve GPU utilization less kernel launch overhead, better memory usage and bandwidth
- Vertical fusion = Combine sequential kernel calls
- Horizontal fusion = Combine same kernels that have common input but different weights



#### **Un-Optimized Network**





## **KERNEL AUTO-TUNING**

- There are multiple low-level algorithms/implementations for common operations
- TensorRT selects the optimal kernels based on your parameters e.g. batch size, filter-size, input data size
- TensorRT selects the optimal kernel based on your target platform

# **CONVOLUTION ALGORITHMS**

128x128x128x128 convolution, FP32, NCHW, Quadro GV100

| CUDNN_CONVOLUTION_FWD_ALGO                       | 3 x 3       |           | 11 x 11     |           |
|--------------------------------------------------|-------------|-----------|-------------|-----------|
| CODAR_CONVOLUTION_FWD_ALGO                       | Performance | Workspace | Performance | Workspace |
| CUDNN_CONVOLUTION_FWD_ALGO_GEMM                  | 0.76 ms     | 72 MB     | 8.47 ms     | 968 MB    |
| CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM         | 0.62 ms     | None      | 6.82 ms     | None      |
| CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM | 0.47 ms     | 0.01 MB   | 6.58 ms     | 0.01 MB   |
| CUDNN_CONVOLUTION_FWD_ALGO_FFT                   | 45.3 ms     | 8322 MB   | 44.7 ms     | 8328 MB   |
| CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING            | 3.69 ms     | 70 MB     | 5.13 ms     | 70 MB     |
| CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD              | 0.26 ms     | 1.56 MB   | Unsupported |           |
| CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED     | 2.73 ms     | 578 MB    | Unsupported |           |



### SPARSE NEURAL NETWORKS

Synapse density over time



#### Synapse Density Over Time FIGURE 3

Source: Adapted from Corel, JL. The postnatal development of the human cerebral cortex. Cambridge, MA: Harvard University Press; 1975.

## PRUNING

#### The idea

The opportunity:

- Reduced memory bandwidth
- Reduced memory footprint
- Acceleration (especially in presence of hardware acceleration)







Tambe, T., Yang, E. Y., Wan, Z., Deng, Y., Reddi, V. J., Rush, A., ... & Wei, G. Y. (2019). AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference. arXiv preprint arXiv:1909.13271.

### CHERRY PICKING IN SPARSE MATRICES

- Memory operations are issued per warp (32 threads)
  - Just like all other instructions
- If only a single byte is needed -
  - 32 bytes will be issued, and only 1 will be used.





## **GOALS FOR A TRAINING RECIPE**

Maintains accuracy

Is applicable across various tasks, network architectures, and optimizers

Does not require hyper-parameter searches

## STRUCTURED SPARSITY

### **SPARSITY IN AMPERE**

#### At Most 2 Non-zeros in Every Contiguous Group of 4 Values







### 2:4 COMPRESSED MATRIX FORMAT

#### At most 2 non-zeros in every contiguous group of 4 values



Metadata: 2b per non-zero element

16b data => 12.5% overhead

8b data => 25% overhead

### FINE-GRAINED STRUCTURED SPARSITY IN AMPERE



### SPARSITY IN AMPERE GPUS

#### Fine-grained structured sparsity for Tensor Cores

- 50% fine-grained sparsity
- 2:4 pattern: 2 values out of each contiguous block of 4 must be 0

#### Addresses the 3 challenges:

- Accuracy: maintains accuracy of the original, unpruned network
  - Medium sparsity level (50%), fine-grained
- Training: a recipe shown to work across tasks and networks
- Speedup:
  - Specialized Tensor Core support for sparse math
  - Structured: lends itself to efficient memory utilization

#### 2:4 structured-sparse matrix



## NLP - LANGUAGE MODELING

#### Transformer-XL, BERT

|                |            |        |            | Accuracy    |             |  |  |  |
|----------------|------------|--------|------------|-------------|-------------|--|--|--|
| Network        | Task       | Metric | Dense FP16 | Sparse FP16 | Sparse INT8 |  |  |  |
| Transformer-XL | enwik8     | BPC    | 1.06       | 1.06 -      | -           |  |  |  |
| BERT-Base      | SQuAD v1.1 | F1     | 87.6       | 88.1 0.5    | 88.1 0.5    |  |  |  |
| BERT-Large     | SQuAD v1.1 | F1     | 91.1       | 91.5 0.4    | 91.5 0.4    |  |  |  |

## GENERATE A STRUCTURED SPARSE NETWORK

APEX's Automatic SParsity: ASP

```
import torch
from apex.contrib.sparsity import ASP
device = torch.device('cuda')
```

Init mask buffers, tell optimizer to mask weights and gradients, compute sparse masks: Universal Fine Tuning

model = TheModelClass(\*args, \*\*kwargs) # Define model structure model.load\_state\_dict(torch.load(`dense\_model.pth'))

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Define optimizer

```
ASP.prune_trained_model(model, optimizer)
```

```
x, y = DataLoader(...) #load data samples and labels to train the model
for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

torch.save(model.state\_dict(), 'pruned\_model.pth') # checkpoint has weights and masks

## Summary

- NVIDIA is an accelerated computing platform
- Optimizing the entire stack from HW to applications
- \* "CUDA Everywhere" One Ring to Rule Them All!
- Hardware <-> Software Interactions for Optimal Performance

## THANK YOU!