python - python - using ctypes and SSE/AVX SOMETIMES segfaults

Question

+ I'm trying to optimize a piece of python code using AVX. I'm using ctypes to access the C++ function. Sometimes the functions segfaults and sometimes dont. I think it maybe has got something to do with the alignment? Maybe anyone can help me with this, I'm kinda stuck here.

Python-Code:

from ctypes import *
import numpy as np
#path_cnt
path_cnt = 16
c_path_cnt = c_int(path_cnt)

#ndarray1
ndarray1      = np.ones(path_cnt,dtype=np.float32,order='C')
ndarray1.setflags(align=1,write=1)
c_ndarray1     = stock.ctypes.data_as(POINTER(c_float))

#ndarray2
ndarray2    = np.ones(path_cnt,dtype=np.float32,order='C');
ndarray2.setflags(align=1,write=1)
c_ndarray2  = max_vola.ctypes.data_as(POINTER(c_float))

#call function
finance = cdll.LoadLibrary(".../libfin.so")
finance.foobar.argtypes = [c_void_p, c_void_p,c_int]
finance.foobar(c_ndarray1,c_ndarray2,c_path_cnt)
x=0
while x < path_cnt:   
    print  c_stock[x]
    x+=1

C++ Code

extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
 {
     for(int i=0;i<path_cnt;i=i+8)
     {
         __m256 arr1                = _mm256_load_ps(&ndarray1[i]);
         __m256 arr2                    = _mm256_load_ps(&ndarray2[i]);
         __m256 add                 = _mm256_add_ps(arr1,arr2);
         _mm256_store_ps(&ndarray1[i],add);
     }
     return 0;
 }
}

And now the odd output behavior, making the some call in terminal twice gives different results!

tobias@tobias-Lenovo-U310:~/workspace/finance$ python finance.py 
Segmentation fault (core dumped)
tobias@tobias-Lenovo-U310:~/workspace/finance$ python finance.py 
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0

Thanks in advance!

score 6 · Accepted Answer

There are aligned and unaligned load instructions. The aligned ones will fault if you violate the alignment rules, but they are faster. The unaligned ones accept any address and do loads/shifts internally to get the data you want. You are using the aligned version, _mm256_load_ps and can just switch to the unaligned version _mm256_loadu_ps without any intermediate allocation.

A good vectorizing compiler will include a lead-in loop to reach an aligned address, then a body to work on aligned data, then a final loop to clean up any stragglers.

score 1 · Accepted Answer

Allright, I tink I found a sultion, its not very elegant but it works at least! The should be a better way, anyone any suggestions?

extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
 {
     float * test = (float*)_mm_malloc(path_cnt*sizeof(float),32);
     float * test2 = (float*)_mm_malloc(path_cnt*sizeof(float),32);
     //copy to aligned memory(this part is kinda stupid)
     for(int i=0;i<path_cnt;i++)
     {
        test[i] = stock[i];
        test2[i] = max_vola[i];

     }
     for(int i=0;i<path_cnt;i=i+8)
     {
         __m256 arr1                = _mm256_load_ps(&test1[i]);
         __m256 arr2                    = _mm256_load_ps(&test2[i]);
         __m256 add                 = _mm256_add_ps(arr1,arr2);
         _mm256_store_ps(&test1[i],add);
     }
  //and copy everything back!
   for(int i=0;i<path_cnt;i++)
    {
    stock[i] = test[i];   
    }
     return 0;
 }
}

python - python - using ctypes and SSE/AVX SOMETIMES segfaults

2 回答 2

Related

Reference