Handling Large Dataset
Tokenizers Training
Iterate over python file handler to read data gradually. Refer to .
with open("file.txt", "r", buffering=8192) as f:
# buffering defaults to -1, usually works because it
# dynamically determines the efficient block size for the hardware.
for line in f:
# This reads line in chunks
print(line)
For encoder, use a iterator and yield results step by step.
Transformer Training
Load data(encoded token IDs) gradually into the memory, not all at once(LLaMA data is 2.8 TB).
Use numpy.memmap to lazily load only the accessed parts into memory.