Resource Accounting
Floating points
- float32 - fp32(full precision): 8 bits dynamic range
- float16 - fp16(half precision): 5 bit dynamic range
- bf16 (Google Brain): 8 bits dynamic range
- fp8 (Nvidia)
Tensors in Pytorch
A pointer to allocated memory. Different dimensions are stored in a linear way(contiguously).
These operations doesn’t copy storage:
x = torch.tensor([[1,2,3], [4,5,6]])
x[0] # indexing, slicing
x.T # transpose
x.view(2, 3) # view
# The following doesn't work because x.T dimension is non-contiguous.
x.T.view(2, 3)
Be careful when mutating matrices, all views changes, too.
Element-wise operations creates new storages.
Tensor Operation Cost
FLOP: Floating Point Operation. a + b, a * b.
Two notations:
- FLOPs: number of floating point operations.
- FLOPS or FLOP/s: speed of computing.
Take FP32(float) operations for example:
3060 Ti: 16.2 TFLOPs
A100 : 312 TFLOPs [20 x 3060Ti]
H100 : 1979 TFLOPs [6 x A100]
To train GPT-3(2020), you need \(3.14 \times 10^{23}\) FLOPs.
- 3060 Ti: 615 Years
- A100: 32 years
- H100: 5 years
If I want to train GPT-3 at home for 1 month, I would need roughly 8000 3060 Ti.
GPT-4 (2023) is speculated to take \(2 \times 10^{25}\) FLOPs, 100x GPT-3.
FLOPs
Matrix Multiplication(Dominates the FLOPS): \(2 \times m \times n \times k\) FLOPs
Element-wise Operations on a \(m \times n\) matrix requires \(O(m \times n)\) FLOPs.
- forward pass: \(2 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.
- backward pass: \(4 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.
- Total: \(6 \times N_{tokens} \times N_{parameters}\) FLOPs for transformer.
Model FLOPs utilization (MFU)
Ignores the communication / overhead.
\[MFU = \frac{actual \ FLOP/s}{promised \ FLOP/s}\]