EDIT (I work at OpenAI and wrote the statement about the variance of the gradient being linear): Here's a more precise statement: the variance is exponential in the "difficulty" of the exploration problem. The harder the exploration, the worse is the gradient. So while it is correct that things become easy if you assume that exploration is easy, the more correct way of interpreting our result is that the combination of self play and our shaped reward made the gradient variance manageable at the scale of the compute that we've use.