c++ - Optimise byte operations CUDA -


i relatively new cuda , trying write kernel calculates sum of absolute differences between query vector , large database of vectors. elements of both must 8 bit unsigned ints. i've based kernel off nvidias sample parallel reduction kernel, i've read thread.

i getting 5gb/s not better fast cpu , not come close theoretical bandwidth of ddr5 gt640 of 80gb/s.

my data set consists of 1024 bytes query vector, 100,000 x 1024 bytes database

i have 100,000 blocks of 128 threads, if each block accesses same 1024 byte query_vector, going cause worse performance? since every block accessing same memory location.

blocksize , shared memory both set 128 , 128*sizeof(int), 128 #define'd threads_per_block

template<uint blocksize> __global__ void reduction_sum_abs( byte* query_vector, byte* db_vector, uint32_t* result ) {     extern __shared__ uint sum[];      uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ;      uint = threadidx.x;       sum[threadidx.x] = 0;       int* p_q_int = reinterpret_cast<int*>(query_vector);      int* p_db_int = reinterpret_cast<int*>(db_vector);       while( < vector_size/4 ) {          /* memory transaction */         int q_int = p_q_int[i];          int db_int = p_db_int[db_linear_index*vector_size/4 + i];           uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int);          uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int);           /* sum of absolute difference */          sum[threadidx.x] += abs( (int)a0.x - b0.x );          sum[threadidx.x] += abs( (int)a0.y - b0.y );          sum[threadidx.x] += abs( (int)a0.z - b0.z );          sum[threadidx.x] += abs( (int)a0.w - b0.w );           += threads_per_block;       }      __syncthreads();       if ( blocksize >= 128 ) {         if ( threadidx.x < 64 ) {              sum[threadidx.x] += sum[threadidx.x + 64];          }     }      /* reduce final warp */     if ( threadidx.x < 32 ) {                 if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads();           if ( blocksize >= 32 ) { sum[threadidx.x] += sum[threadidx.x + 16]; } __syncthreads();           if ( blocksize >= 16 ) { sum[threadidx.x] += sum[threadidx.x + 8 ]; } __syncthreads();           if ( blocksize >= 8  ) { sum[threadidx.x] += sum[threadidx.x + 4 ]; } __syncthreads();           if ( blocksize >= 4  ) { sum[threadidx.x] += sum[threadidx.x + 2 ]; } __syncthreads();           if ( blocksize >= 2  ) { sum[threadidx.x] += sum[threadidx.x + 1 ]; } __syncthreads();       }       /* copy sum global */     if ( threadidx.x == 0 ) {         result[db_linear_index] = sum[0];      } } 

i can 4x bandwith increase if run kernel 4 lines of code actual absolute difference calculation commented out, results in wrong answer, believe @ least significant portion of time spent there.

is possible creating bank conflicts way accessing bytes? if can avoid conflicts?

is usage of reinterpret_cast correct?

is there better method doing 8 bit unsigned calculations?

what other (i assume many, i'm complete novice) optimisations can make?

thanks

edit:

my machine specs follows:

windows xp 2002 sp3

intel 6600 2.40ghz

2gb ram

gt640 gddr5 1gb

visual c++ 2010 express

it's practice questions these provide complete code can compile , run, without having add or change anything. speaking, expects this. since question performance, should include in complete code actual timing measurement methodology.

fixing errors:

there @ least 2 errors in code, 1 of @jez has pointed out already. after "partial reduction" step:

if ( blocksize >= 128 ) {     if ( threadidx.x < 64 ) {          sum[threadidx.x] += sum[threadidx.x + 64];      } } 

we need __syncthreads(); before proceeding remainder. above change, able kernel produce repeatable results matched naive host implementation. also, since have conditional code not evaluate same across threadblock:

if ( threadidx.x < 32 ) {   

it illegal have __syncthreads() statement inside conditional code block:

  if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads();  

(and likewise subsequent lines same thing). it's recommended fix that. there few ways can solve this, 1 of switch using volatile typed pointer refer shared data. since operating within warp, volatile qualifier forces compiler want:

volatile uint *vsum = sum; if ( threadidx.x < 32 ) {             if ( blocksize >= 64 ) vsum[threadidx.x] += vsum[threadidx.x + 32];     if ( blocksize >= 32 ) vsum[threadidx.x] += vsum[threadidx.x + 16];      if ( blocksize >= 16 ) vsum[threadidx.x] += vsum[threadidx.x + 8 ];     if ( blocksize >= 8  ) vsum[threadidx.x] += vsum[threadidx.x + 4 ];     if ( blocksize >= 4  ) vsum[threadidx.x] += vsum[threadidx.x + 2 ];      if ( blocksize >= 2  ) vsum[threadidx.x] += vsum[threadidx.x + 1 ]; } 

the cuda parallel reduction sample code , associated pdf may review you.

timing/perf analysis:

i happen have gt 640, cc3.5 device. when run bandwidthtest on it, observe 32gb/s device-to-device transfers. number representative of reasonable approximate upper bound on achievable bandwidth when device kernels accessing device memory. also, when add cudaevent based timing , build sample code around have shown, simulated data, observe throughput of around 16gb/s, not 5gb/s. actual measurement technique useful info here (in fact, complete code needed analyze differences between timing of kernel, , timing).

the question remains, then, can improved? (assuming ~32gb/s approximate upper bound).

your questions:

is possible creating bank conflicts way accessing bytes? if can avoid conflicts?

since kernel loads bytes 32-bit quantity (uchar4), , each thread loading adjacent, consecutive 32-bit quantity, don't believe there bank-conflict access problems kernel.

is usage of reinterpret_cast correct?

yes, appears correct (my sample code below, above mentioned fixes, validates results produced kernel match naive host function implementation.)

is there better method doing 8 bit unsigned calculations?

there is, , in case, pointed out @njuffa, simd intrinsics can handle this, turns out, single instruction (__vsadu4(), see sample code below).

what other (i assume many, i'm complete novice) optimisations can make?

  1. use cc3.0 warp-shuffle reduction method proposed @michalhosala

  2. make use of simd instrinsic __vsadu4() simplify , improve handling of byte quantities proposed @njuffa.

  3. reorganize database vector data in column-major storage. allows dispense ordinary parallel reduction method (even mentioned in item 1) , switch straight for-loop read kernel, 1 thread computing entire vector comparison. allows our kernel reach approximately memory bandwidth of device in case (cc3.5 gt640).

here code , sample run, showing 3 implementations: original implementation (plus above named "fixes" produce correct results), opt1 kernel modifies yours include items 1 , 2 list above, , opt2 kernel replaces yours approach utilizing 2 , 3 list above. according measurement, kernel achieves 16gb/s, half of bandwidth of gt640, opt1 kernel runs @ 24gb/s (the increase coming in approximately equal parts items 1 , 2 above), , opt2 kernel, data-reorganization, runs @ approximately full bandwidth (36gb/s).

$ cat t574.cu #include <stdio.h> #include <stdlib.h> #define threads_per_block 128 #define vector_size 1024 #define num_db_vec 100000  typedef unsigned char byte; typedef unsigned int uint; typedef unsigned int uint32_t;   template<uint blocksize> __global__ void reduction_sum_abs( byte* query_vector, byte* db_vector, uint32_t* result ) {     extern __shared__ uint sum[];     uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ;     uint = threadidx.x;      sum[threadidx.x] = 0;      int* p_q_int = reinterpret_cast<int*>(query_vector);     int* p_db_int = reinterpret_cast<int*>(db_vector);      while( < vector_size/4 ) {          /* memory transaction */         int q_int = p_q_int[i];         int db_int = p_db_int[db_linear_index*vector_size/4 + i];          uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int);         uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int);          /* sum of absolute difference */         sum[threadidx.x] += abs( (int)a0.x - b0.x );         sum[threadidx.x] += abs( (int)a0.y - b0.y );         sum[threadidx.x] += abs( (int)a0.z - b0.z );         sum[threadidx.x] += abs( (int)a0.w - b0.w );          += threads_per_block;      }      __syncthreads();      if ( blocksize >= 128 ) {         if ( threadidx.x < 64 ) {             sum[threadidx.x] += sum[threadidx.x + 64];         }     }     __syncthreads(); // **     /* reduce final warp */     if ( threadidx.x < 32 ) {         if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads();          if ( blocksize >= 32 ) { sum[threadidx.x] += sum[threadidx.x + 16]; } __syncthreads();          if ( blocksize >= 16 ) { sum[threadidx.x] += sum[threadidx.x + 8 ]; } __syncthreads();          if ( blocksize >= 8  ) { sum[threadidx.x] += sum[threadidx.x + 4 ]; } __syncthreads();          if ( blocksize >= 4  ) { sum[threadidx.x] += sum[threadidx.x + 2 ]; } __syncthreads();          if ( blocksize >= 2  ) { sum[threadidx.x] += sum[threadidx.x + 1 ]; } __syncthreads();      }       /* copy sum global */     if ( threadidx.x == 0 ) {         result[db_linear_index] = sum[0];     } }  __global__ void reduction_sum_abs_opt1( byte* query_vector, byte* db_vector, uint32_t* result ) {   __shared__ uint sum[threads_per_block];   uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ;   uint = threadidx.x;    sum[threadidx.x] = 0;    uint* p_q_int = reinterpret_cast<uint*>(query_vector);   uint* p_db_int = reinterpret_cast<uint*>(db_vector);    while( < vector_size/4 ) {      /* memory transaction */     uint q_int = p_q_int[i];     uint db_int = p_db_int[db_linear_index*vector_size/4 + i];     sum[threadidx.x] += __vsadu4(q_int, db_int);      += threads_per_block;      }   __syncthreads();   // reduction assumes threads_per_block = 128   if (threadidx.x < 64) sum[threadidx.x] += sum[threadidx.x+64];   __syncthreads();    if ( threadidx.x < 32 ) {     unsigned localsum = sum[threadidx.x] + sum[threadidx.x + 32];     (int = 16; >= 1; /= 2)       localsum = localsum + __shfl_xor(localsum, i);     if (threadidx.x == 0) result[db_linear_index] = localsum;     } }  __global__ void reduction_sum_abs_opt2( byte* query_vector, uint* db_vector_cm, uint32_t* result) {   __shared__ uint qv[vector_size/4];   if (threadidx.x < vector_size/4) qv[threadidx.x] = *(reinterpret_cast<uint *>(query_vector) + threadidx.x);   __syncthreads();   int idx = threadidx.x + blockdim.x*blockidx.x;   while (idx < num_db_vec){     uint sum = 0;     (int = 0; < vector_size/4; i++)       sum += __vsadu4(qv[i], db_vector_cm[(i*num_db_vec)+idx]);     result[idx] = sum;     idx += griddim.x*blockdim.x;} }  unsigned long compute_host_result(byte *qvec, byte *db_vec){    unsigned long temp = 0;   (int =0; < num_db_vec; i++)     (int j = 0; j < vector_size; j++)       temp += (unsigned long) abs((int)qvec[j] - (int)db_vec[(i*vector_size)+j]);   return temp; }  int main(){    float et;   cudaevent_t start, stop;   byte *h_qvec, *d_qvec, *h_db_vec, *d_db_vec;   uint32_t *h_res, *d_res;   h_qvec =   (byte *)malloc(vector_size*sizeof(byte));   h_db_vec = (byte *)malloc(vector_size*num_db_vec*sizeof(byte));   h_res = (uint32_t *)malloc(num_db_vec*sizeof(uint32_t));   (int = 0; < vector_size; i++){     h_qvec[i] = rand()%256;     (int j = 0; j < num_db_vec; j++) h_db_vec[(j*vector_size)+i] = rand()%256;}   cudamalloc(&d_qvec, vector_size*sizeof(byte));   cudamalloc(&d_db_vec, vector_size*num_db_vec*sizeof(byte));   cudamalloc(&d_res, num_db_vec*sizeof(uint32_t));   cudamemcpy(d_qvec, h_qvec, vector_size*sizeof(byte), cudamemcpyhosttodevice);   cudamemcpy(d_db_vec, h_db_vec, vector_size*num_db_vec*sizeof(byte), cudamemcpyhosttodevice);   cudaeventcreate(&start); cudaeventcreate(&stop);  // initial run    cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t));   cudaeventrecord(start);   reduction_sum_abs<threads_per_block><<<num_db_vec, threads_per_block, threads_per_block*sizeof(int)>>>(d_qvec, d_db_vec, d_res);   cudaeventrecord(stop);   cudadevicesynchronize();   cudaeventsynchronize(stop);   cudaeventelapsedtime(&et, start, stop);   cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost);   unsigned long h_result = 0;   (int = 0; < num_db_vec; i++) h_result += h_res[i];   printf("1: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000));   if (h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n");   else printf("1: mismatch!\n");  // optimized kernel 1   cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t));   cudaeventrecord(start);   reduction_sum_abs_opt1<<<num_db_vec, threads_per_block>>>(d_qvec, d_db_vec, d_res);   cudaeventrecord(stop);   cudadevicesynchronize();   cudaeventsynchronize(stop);   cudaeventelapsedtime(&et, start, stop);   cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost);   h_result = 0;   (int = 0; < num_db_vec; i++) h_result += h_res[i];   printf("2: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000));   if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n");   else printf("2: mismatch!\n");  // convert db_vec column-major storage optimized kernel 2    uint *h_db_vec_cm, *d_db_vec_cm;   h_db_vec_cm = (uint *)malloc(num_db_vec*(vector_size/4)*sizeof(uint));   cudamalloc(&d_db_vec_cm, num_db_vec*(vector_size/4)*sizeof(uint));   (int = 0; < num_db_vec; i++)     (int j = 0; j < vector_size/4; j++)       h_db_vec_cm[(j*num_db_vec)+i] = *(reinterpret_cast<uint *>(h_db_vec + (i*vector_size))+j);   cudamemcpy(d_db_vec_cm, h_db_vec_cm, num_db_vec*(vector_size/4)*sizeof(uint), cudamemcpyhosttodevice);   cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t));   cudaeventrecord(start);   reduction_sum_abs_opt2<<<64, 512>>>(d_qvec, d_db_vec_cm, d_res);   cudaeventrecord(stop);   cudadevicesynchronize();   cudaeventsynchronize(stop);   cudaeventelapsedtime(&et, start, stop);   cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost);   h_result = 0;   (int = 0; < num_db_vec; i++) h_result += h_res[i];   printf("3: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000));   if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n");   else printf("3: mismatch!\n");    return 0; }  $ nvcc -o3 -arch=sm_35 -o t574 t574.cu $ ./run35 t574 1: et: 6.34ms, bw: 16.14gb/s success! 2: et: 4.16ms, bw: 24.61gb/s success! 3: et: 2.83ms, bw: 36.19gb/s success! $ 

a few notes:

  1. the above code, in particular kernel, must compiled cc3.0 or higher, way have test case set up. because creating 100,000 blocks in single 1d grid, cannot run as-is on cc2.0 device, example.
  2. there may additional slight tuning possible, if run on different devices, opt2 kernel, modifying grid , block parameters. have these set @ 64 , 512, , these values should not critical (although block should vector_size/4 threads or greater), since algorithm uses grid-striding loop cover entire vector set. gt640 has 2 sm's, in case 64 threadblocks enough keep device busy (probably 32 ok). might want modify these max performance on larger devices.

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -