What is the difference between cuda vs tensor cores?

I am completely new to terms related to HPC computing, but I just saw that EC2 released their new type of instance on AWS that's powered by the new Nvidia Tesla V100, which has both kind of "cores": Cuda Cores (5.120), and Tensor Cores (640). What is the difference between both?


Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core perform operations on small matrices with size 4x4. Each tensor core can perform 1 matrix multiply-accumulate operation per 1 GPU clock. It multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix (size: 4x4) to accumulator (that is also fp32 4x4 matrix).

It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices.

Probably, the proper name would be just 4x4 matrix cores however NVIDIA marketing team decided to use "tensor cores".

GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit. Tensor cores are just more heavily specialised to the types of computation involved in machine learning software (such as Tensorflow).

Nvidia have written a detailed blog here, which goes into far more detail on how Tensor cores work and the preformance improvements over CUDA cores.

Tensor cores use a lot less computation power at the expense of precision than Cuda Cores, but that loss of precision doesn't have that much effect on the final output.

This is why for Machine Learning models, Tensor Cores are more effective at cost reduction without changing the output that much.

Google itself uses the Tensor Processing Units for google translate.

Need Your Help

Calculating the height of a binary tree

data-structures binary-tree

I need help with the theory on calculating the height of a binary tree, typically the notation.

Unbalanced calls to begin/end appearance transitions for <UITabBarController: 0x197870>

ios ios4 uitabbarcontroller

I read SO about another user encountering similar error, but this error is in different case.