performance - Python matrix provide with numpy.dot() -


during acquaintance cuda in python (numba lib), implemented matrix provide methods:

  • just numpy.dot()
  • strassen algorithm numpy.dot()
  • blocks method on gpu
  • strassen algorithm on gpu

so tested on 2 types of data:

  • numpy.random.randint(0, 5, (n, n)) # int32 elements
  • numpy.random.random((n, n)) # float64 elements

for int32 obtained expected result, gpu algroithms performed better cpu numpy: enter image description here

however, on float64 type, numpy.dot() outperformed gpu methods: enter image description here

so, question is: why numpy.dot() fast float64 arrays, , numpy use gpu?

a typical installation of numpy dynamically linked against blas library, provides routines matrix-matrix , matrix-vector multiplication. example, when use np.dot() on pair of float64 arrays, numpy call blas dgemm routine in background. although these library functions run on cpu rather gpu, multithreaded, , finely tuned performance. blas implementation, such mkl or openblas, hard beat in terms of performance, on gpu*.

however, blas supports floating point types. if call np.dot() on integer arrays, numpy fall on using a simple internal c++ implementation, single-threaded , slower blas dot on 2 floating point arrays.

without knowing more how conducted benchmarks, bet plain call numpy.dot comfortably beat other 3 methods float32, complex64 , complex128 arrays, other 3 types supported blas.


* 1 possible way beat standard blas use cublas, blas implementation run on nvidia gpu. scikit-cuda library seems provide python bindings it, although i've never used myself.


Comments

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

java - How to filter a backspace keyboard input -

java - Show Soft Keyboard when EditText Appears -