Lackluster performance using inlined function in cython memoryview

Question

I wrote the following Cython code that calculates pairwise distances à la scipy.spatial.distance.cdist.

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
cimport numpy as np
cimport cython
import numpy as np

from libc.math cimport sqrt

cdef inline double dist(double[:] a, double[:] b) nogil:
    return sqrt((a[0] - b[0])**2 +
                (a[1] - b[1])**2 +
                (a[2] - b[2])**2)

cpdef double [:, :] dists_inline(double[:, :] a, double[:, :] b):
    cdef unsigned int i, j
    cdef double[:, :] d = np.empty((a.shape[0], b.shape[0]))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            d[i, j] = dist(a[i], b[j])
    return np.asarray(d)

cpdef double [:, :] dists(double[:, :] a, double[:, :] b):
    cdef unsigned int i, j
    cdef double[:, :] d = np.empty((a.shape[0], b.shape[0]))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            d[i, j] = sqrt((a[i, 0] - b[j, 0])**2 +
                           (a[i, 1] - b[j, 1])**2 +
                           (a[i, 2] - b[j, 2])**2)
    return np.asarray(d)

import numpy as np
from scipy.spatial.distance import cdist
neigh_k = np.random.random((14, 3))*2 - 1
neigh_k = np.vstack([[0, 0, 0], neigh_k])
k_points = np.random.random((100000, 3))*10 - 5

It manages to achieve the same performance as scipy.spatial.cdist:

%timeit cdist(neigh_k, k_points)
3.94 ms ± 458 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit dists(neigh_k, k_points)
3.74 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, when using the inline version of the function above, the performance drops by a factor of 10.

%timeit dists_inline(neigh_k, k_points)
41.3 ms ± 75.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As far as I understand, dists and dists_inline should generate almost the exact same binary.

(Similar question, where simply using memoryviews was the solution: Cython inline function with numpy array as parameter)

It works correctly if you pass

cdef inline double dist2(double[:, :] a, double[:, :] b, int i, int j) nogil:
    return sqrt((a[i, 0] - b[j, 0])**2 +
                (a[i, 1] - b[j, 1])**2 +
                (a[i, 2] - b[j, 2])**2)

In the original case however, it seems some memoryview translation is going on. The Cython visualisation shows me for the call to dist1:

      __pyx_t_13.data = __pyx_v_a.data;
      __pyx_t_13.memview = __pyx_v_a.memview;
      __PYX_INC_MEMVIEW(&__pyx_t_13, 0);
      {
    Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_a.strides[0];
        __pyx_t_13.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_13.shape[0] = __pyx_v_a.shape[1];
__pyx_t_13.strides[0] = __pyx_v_a.strides[1];
    __pyx_t_13.suboffsets[0] = -1;

__pyx_t_14.data = __pyx_v_b.data;
      __pyx_t_14.memview = __pyx_v_b.memview;
      __PYX_INC_MEMVIEW(&__pyx_t_14, 0);
      {
    Py_ssize_t __pyx_tmp_idx = __pyx_v_j;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_b.strides[0];
        __pyx_t_14.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_14.shape[0] = __pyx_v_b.shape[1];
__pyx_t_14.strides[0] = __pyx_v_b.strides[1];
    __pyx_t_14.suboffsets[0] = -1;

__pyx_t_15 = __pyx_v_i;
      __pyx_t_16 = __pyx_v_j;
      *((double *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_d.data + __pyx_t_15 * __pyx_v_d.strides[0]) ) + __pyx_t_16 * __pyx_v_d.strides[1]) )) = __pyx_f_46_cython_magic_2e95e60ce94e9b349e663b4db389bdfa_dist(__pyx_t_13, __pyx_t_14);
      __PYX_XDEC_MEMVIEW(&__pyx_t_13, 1);
      __pyx_t_13.memview = NULL;
      __pyx_t_13.data = NULL;
      __PYX_XDEC_MEMVIEW(&__pyx_t_14, 1);
      __pyx_t_14.memview = NULL;
      __pyx_t_14.data = NULL;
    }
  }

whereas in the second case only the call itself occurs

      __pyx_t_13 = __pyx_v_i;
      __pyx_t_14 = __pyx_v_j;
      *((double *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_d.data + __pyx_t_13 * __pyx_v_d.strides[0]) ) + __pyx_t_14 * __pyx_v_d.strides[1]) )) = __pyx_f_46_cython_magic_2e95e60ce94e9b349e663b4db389bdfa_dist2(__pyx_v_a, __pyx_v_b, __pyx_v_i, __pyx_v_j);
    }
  }

The performance does not improve even if I explicitly declare C-contiguousness.

I would expect it to essentially compile to the following if C-contiguousness is declared:

cdef inline double dist3(double* a, double* b) nogil:
    return sqrt((a[0] - b[0])**2 +
                (a[1] - b[1])**2 +
                (a[2] - b[2])**2)

cpdef double [:, ::1] dists_inline3(double[:, ::1] a, double[:, ::1] b):
    cdef unsigned int i, j
    cdef double[:, ::1] d = np.empty((a.shape[0], b.shape[0]))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            d[i, j] = dist3(&a[i,0], &b[j,0])
    return np.asarray(d)

This works and has the expected performance.

Jérôme Richard · Accepted Answer · 2023-04-26 21:15:47Z

The problem seems to come from the reference counting of the memory views. More specifically, the macros __PYX_INC_MEMVIEW and __Pyx_XDEC_MEMVIEW can be particularly expensive. Based on this post, we can see that disabling the GIL can provide a substantial speed-up due to the way the implementation works, especially if the function is properly inlined (which is not guaranteed). It turns out that dist is nogil but the two others are not so Cython may generate a less efficient code because of the missing nogil keyword.

Besides, the faster implementation is still not efficient because the memory layout is not SIMD friendly. The compiler cannot easily generate a SIMD code fully using your hardware. Thus, I expect it to generate a slow scalar implementation. To fix this issue, you can transpose your input so it is stored in the format x x x x ... y y y y ... z z z z in memory instead of x y z x y z x y z ... x y z. With the former layout, the compiler can read 4 x with one instruction (assuming your machine supports AVX which is supported by most x86-64 modern processors), do the same with y and z with two instructions, and do all operations like subtractions, squares and square roots 4 items at once. The resulting code can be up to 4 time faster. Some computing machines can even operate on wider SIMD registers (e.g. the ones supporting AVX-512).

Thanks for the answer, but I think neither part is correct. 1) Even with various nogil encapsulations of the outer function or loop, this does not change the calling signature of the inlined function as far as I can tell. 2) Changing the memory layout does not change the performance, neither for the slow inlined case, nor for the fast case. — mueslo, Commented Apr 27, 2023 at 9:37

Collectives™ on Stack Overflow

Lackluster performance using inlined function in cython memoryview

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
numpy
cython
inline
memoryview
typed-memory-views
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged numpycythoninlinememoryviewtyped-memory-views or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
numpy
cython
inline
memoryview
typed-memory-views
or ask your own question.