performance - Python matrix provide with numpy.dot() -
during acquaintance cuda in python (numba lib), implemented matrix provide methods:
- just
numpy.dot()
- strassen algorithm
numpy.dot()
- blocks method on gpu
- strassen algorithm on gpu
so tested on 2 types of data:
numpy.random.randint(0, 5, (n, n)) # int32 elements
numpy.random.random((n, n)) # float64 elements
for int32 obtained expected result, gpu algroithms performed better cpu numpy:
however, on float64 type, numpy.dot()
outperformed gpu methods:
so, question is: why numpy.dot()
fast float64
arrays, , numpy use gpu?
a typical installation of numpy dynamically linked against blas library, provides routines matrix-matrix , matrix-vector multiplication. example, when use np.dot()
on pair of float64 arrays, numpy call blas dgemm
routine in background. although these library functions run on cpu rather gpu, multithreaded, , finely tuned performance. blas implementation, such mkl or openblas, hard beat in terms of performance, on gpu*.
however, blas supports floating point types. if call np.dot()
on integer arrays, numpy fall on using a simple internal c++ implementation, single-threaded , slower blas dot on 2 floating point arrays.
without knowing more how conducted benchmarks, bet plain call numpy.dot
comfortably beat other 3 methods float32, complex64 , complex128 arrays, other 3 types supported blas.
* 1 possible way beat standard blas use cublas, blas implementation run on nvidia gpu. scikit-cuda
library seems provide python bindings it, although i've never used myself.
Comments
Post a Comment