Efficient parallelization of non-uniform fast multipole algorithms