Position paper accepted to ICML arguing the benchmarking is limited and additional types of experimentation is needed.