Resource Accounting

Floating points

  1. float32 - fp32(full precision): 8 bits dynamic range
  2. float16 - fp16(half precision): 5 bit dynamic range
  3. bf16 (Google Brain): 8 bits dynamic range
  4. fp8 (Nvidia)

Tensors in Pytorch

A pointer to allocated memory. Different dimensions are stored in a linear way(contiguously).

These operations doesn’t copy storage:

x = torch.tensor([[1,2,3], [4,5,6]])
x[0] # indexing, slicing
x.T # transpose
x.view(2, 3) # view

# The following doesn't work because x.T dimension is non-contiguous.
x.T.view(2, 3)

Be careful when mutating matrices, all views changes, too.

Element-wise operations creates new storages.

Tensor Operation Cost

FLOP: Floating Point Operation. a + b, a * b.

Two notations:

  1. FLOPs: number of floating point operations.
  2. FLOPS or FLOP/s: speed of computing.

Take FP32(float) operations for example:

3060 Ti: 16.2 TFLOPs

A100 : 312 TFLOPs [20 x 3060Ti]

H100 : 1979 TFLOPs [6 x A100]

To train GPT-3(2020), you need \(3.14 \times 10^{23}\) FLOPs.

If I want to train GPT-3 at home for 1 month, I would need roughly 8000 3060 Ti.

GPT-4 (2023) is speculated to take \(2 \times 10^{25}\) FLOPs, 100x GPT-3.

FLOPs

Matrix Multiplication(Dominates the FLOPS): \(2 \times m \times n \times k\) FLOPs

Element-wise Operations on a \(m \times n\) matrix requires \(O(m \times n)\) FLOPs.

  1. forward pass: \(2 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.
  2. backward pass: \(4 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.
  3. Total: \(6 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.

Model FLOPs utilization (MFU)

Ignores the communication / overhead.

\[MFU = \frac{actual \ FLOP/s}{promised \ FLOP/s}\] · LLM