Handling Large Dataset

Tokenizers Training

Iterate over python file handler to read data gradually. Refer to python doc .

with open("file.txt", "r", buffering=8192) as f:
  # buffering defaults to -1, usually works because it
  # dynamically determines the efficient block size for the hardware.
  for line in f:
    # This reads line in chunks
    print(line)

For encoder, use a iterator and yield results step by step.

Transformer Training

Load data(encoded token IDs) gradually into the memory, not all at once(LLaMA data is 2.8 TB).

Use numpy.memmap to lazily load only the accessed parts into memory.

August 20, 2025 · LLM