They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:
"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive
outputs."
Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.