c++ - Optimise byte operations CUDA -
i relatively new cuda , trying write kernel calculates sum of absolute differences between query vector , large database of vectors. elements of both must 8 bit unsigned ints. i've based kernel off nvidias sample parallel reduction kernel, i've read thread.
i getting 5gb/s not better fast cpu , not come close theoretical bandwidth of ddr5 gt640 of 80gb/s.
my data set consists of 1024 bytes query vector, 100,000 x 1024 bytes database
i have 100,000 blocks of 128 threads, if each block accesses same 1024 byte query_vector, going cause worse performance? since every block accessing same memory location.
blocksize , shared memory both set 128 , 128*sizeof(int), 128 #define'd threads_per_block
template<uint blocksize> __global__ void reduction_sum_abs( byte* query_vector, byte* db_vector, uint32_t* result ) { extern __shared__ uint sum[]; uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ; uint = threadidx.x; sum[threadidx.x] = 0; int* p_q_int = reinterpret_cast<int*>(query_vector); int* p_db_int = reinterpret_cast<int*>(db_vector); while( < vector_size/4 ) { /* memory transaction */ int q_int = p_q_int[i]; int db_int = p_db_int[db_linear_index*vector_size/4 + i]; uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int); uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int); /* sum of absolute difference */ sum[threadidx.x] += abs( (int)a0.x - b0.x ); sum[threadidx.x] += abs( (int)a0.y - b0.y ); sum[threadidx.x] += abs( (int)a0.z - b0.z ); sum[threadidx.x] += abs( (int)a0.w - b0.w ); += threads_per_block; } __syncthreads(); if ( blocksize >= 128 ) { if ( threadidx.x < 64 ) { sum[threadidx.x] += sum[threadidx.x + 64]; } } /* reduce final warp */ if ( threadidx.x < 32 ) { if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads(); if ( blocksize >= 32 ) { sum[threadidx.x] += sum[threadidx.x + 16]; } __syncthreads(); if ( blocksize >= 16 ) { sum[threadidx.x] += sum[threadidx.x + 8 ]; } __syncthreads(); if ( blocksize >= 8 ) { sum[threadidx.x] += sum[threadidx.x + 4 ]; } __syncthreads(); if ( blocksize >= 4 ) { sum[threadidx.x] += sum[threadidx.x + 2 ]; } __syncthreads(); if ( blocksize >= 2 ) { sum[threadidx.x] += sum[threadidx.x + 1 ]; } __syncthreads(); } /* copy sum global */ if ( threadidx.x == 0 ) { result[db_linear_index] = sum[0]; } }
i can 4x bandwith increase if run kernel 4 lines of code actual absolute difference calculation commented out, results in wrong answer, believe @ least significant portion of time spent there.
is possible creating bank conflicts way accessing bytes? if can avoid conflicts?
is usage of reinterpret_cast
correct?
is there better method doing 8 bit unsigned calculations?
what other (i assume many, i'm complete novice) optimisations can make?
thanks
edit:
my machine specs follows:
windows xp 2002 sp3
intel 6600 2.40ghz
2gb ram
gt640 gddr5 1gb
visual c++ 2010 express
it's practice questions these provide complete code can compile , run, without having add or change anything. speaking, expects this. since question performance, should include in complete code actual timing measurement methodology.
fixing errors:
there @ least 2 errors in code, 1 of @jez has pointed out already. after "partial reduction" step:
if ( blocksize >= 128 ) { if ( threadidx.x < 64 ) { sum[threadidx.x] += sum[threadidx.x + 64]; } }
we need __syncthreads();
before proceeding remainder. above change, able kernel produce repeatable results matched naive host implementation. also, since have conditional code not evaluate same across threadblock:
if ( threadidx.x < 32 ) {
it illegal have __syncthreads()
statement inside conditional code block:
if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads();
(and likewise subsequent lines same thing). it's recommended fix that. there few ways can solve this, 1 of switch using volatile
typed pointer refer shared data. since operating within warp, volatile
qualifier forces compiler want:
volatile uint *vsum = sum; if ( threadidx.x < 32 ) { if ( blocksize >= 64 ) vsum[threadidx.x] += vsum[threadidx.x + 32]; if ( blocksize >= 32 ) vsum[threadidx.x] += vsum[threadidx.x + 16]; if ( blocksize >= 16 ) vsum[threadidx.x] += vsum[threadidx.x + 8 ]; if ( blocksize >= 8 ) vsum[threadidx.x] += vsum[threadidx.x + 4 ]; if ( blocksize >= 4 ) vsum[threadidx.x] += vsum[threadidx.x + 2 ]; if ( blocksize >= 2 ) vsum[threadidx.x] += vsum[threadidx.x + 1 ]; }
the cuda parallel reduction sample code , associated pdf may review you.
timing/perf analysis:
i happen have gt 640, cc3.5 device. when run bandwidthtest
on it, observe 32gb/s device-to-device transfers. number representative of reasonable approximate upper bound on achievable bandwidth when device kernels accessing device memory. also, when add cudaevent
based timing , build sample code around have shown, simulated data, observe throughput of around 16gb/s, not 5gb/s. actual measurement technique useful info here (in fact, complete code needed analyze differences between timing of kernel, , timing).
the question remains, then, can improved? (assuming ~32gb/s approximate upper bound).
your questions:
is possible creating bank conflicts way accessing bytes? if can avoid conflicts?
since kernel loads bytes 32-bit quantity (uchar4
), , each thread loading adjacent, consecutive 32-bit quantity, don't believe there bank-conflict access problems kernel.
is usage of reinterpret_cast correct?
yes, appears correct (my sample code below, above mentioned fixes, validates results produced kernel match naive host function implementation.)
is there better method doing 8 bit unsigned calculations?
there is, , in case, pointed out @njuffa, simd intrinsics can handle this, turns out, single instruction (__vsadu4()
, see sample code below).
what other (i assume many, i'm complete novice) optimisations can make?
use cc3.0 warp-shuffle reduction method proposed @michalhosala
make use of simd instrinsic
__vsadu4()
simplify , improve handling of byte quantities proposed @njuffa.reorganize database vector data in column-major storage. allows dispense ordinary parallel reduction method (even mentioned in item 1) , switch straight for-loop read kernel, 1 thread computing entire vector comparison. allows our kernel reach approximately memory bandwidth of device in case (cc3.5 gt640).
here code , sample run, showing 3 implementations: original implementation (plus above named "fixes" produce correct results), opt1 kernel modifies yours include items 1 , 2 list above, , opt2 kernel replaces yours approach utilizing 2 , 3 list above. according measurement, kernel achieves 16gb/s, half of bandwidth of gt640, opt1 kernel runs @ 24gb/s (the increase coming in approximately equal parts items 1 , 2 above), , opt2 kernel, data-reorganization, runs @ approximately full bandwidth (36gb/s).
$ cat t574.cu #include <stdio.h> #include <stdlib.h> #define threads_per_block 128 #define vector_size 1024 #define num_db_vec 100000 typedef unsigned char byte; typedef unsigned int uint; typedef unsigned int uint32_t; template<uint blocksize> __global__ void reduction_sum_abs( byte* query_vector, byte* db_vector, uint32_t* result ) { extern __shared__ uint sum[]; uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ; uint = threadidx.x; sum[threadidx.x] = 0; int* p_q_int = reinterpret_cast<int*>(query_vector); int* p_db_int = reinterpret_cast<int*>(db_vector); while( < vector_size/4 ) { /* memory transaction */ int q_int = p_q_int[i]; int db_int = p_db_int[db_linear_index*vector_size/4 + i]; uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int); uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int); /* sum of absolute difference */ sum[threadidx.x] += abs( (int)a0.x - b0.x ); sum[threadidx.x] += abs( (int)a0.y - b0.y ); sum[threadidx.x] += abs( (int)a0.z - b0.z ); sum[threadidx.x] += abs( (int)a0.w - b0.w ); += threads_per_block; } __syncthreads(); if ( blocksize >= 128 ) { if ( threadidx.x < 64 ) { sum[threadidx.x] += sum[threadidx.x + 64]; } } __syncthreads(); // ** /* reduce final warp */ if ( threadidx.x < 32 ) { if ( blocksize >= 64 ) { sum[threadidx.x] += sum[threadidx.x + 32]; } __syncthreads(); if ( blocksize >= 32 ) { sum[threadidx.x] += sum[threadidx.x + 16]; } __syncthreads(); if ( blocksize >= 16 ) { sum[threadidx.x] += sum[threadidx.x + 8 ]; } __syncthreads(); if ( blocksize >= 8 ) { sum[threadidx.x] += sum[threadidx.x + 4 ]; } __syncthreads(); if ( blocksize >= 4 ) { sum[threadidx.x] += sum[threadidx.x + 2 ]; } __syncthreads(); if ( blocksize >= 2 ) { sum[threadidx.x] += sum[threadidx.x + 1 ]; } __syncthreads(); } /* copy sum global */ if ( threadidx.x == 0 ) { result[db_linear_index] = sum[0]; } } __global__ void reduction_sum_abs_opt1( byte* query_vector, byte* db_vector, uint32_t* result ) { __shared__ uint sum[threads_per_block]; uint db_linear_index = (blockidx.y*griddim.x) + blockidx.x ; uint = threadidx.x; sum[threadidx.x] = 0; uint* p_q_int = reinterpret_cast<uint*>(query_vector); uint* p_db_int = reinterpret_cast<uint*>(db_vector); while( < vector_size/4 ) { /* memory transaction */ uint q_int = p_q_int[i]; uint db_int = p_db_int[db_linear_index*vector_size/4 + i]; sum[threadidx.x] += __vsadu4(q_int, db_int); += threads_per_block; } __syncthreads(); // reduction assumes threads_per_block = 128 if (threadidx.x < 64) sum[threadidx.x] += sum[threadidx.x+64]; __syncthreads(); if ( threadidx.x < 32 ) { unsigned localsum = sum[threadidx.x] + sum[threadidx.x + 32]; (int = 16; >= 1; /= 2) localsum = localsum + __shfl_xor(localsum, i); if (threadidx.x == 0) result[db_linear_index] = localsum; } } __global__ void reduction_sum_abs_opt2( byte* query_vector, uint* db_vector_cm, uint32_t* result) { __shared__ uint qv[vector_size/4]; if (threadidx.x < vector_size/4) qv[threadidx.x] = *(reinterpret_cast<uint *>(query_vector) + threadidx.x); __syncthreads(); int idx = threadidx.x + blockdim.x*blockidx.x; while (idx < num_db_vec){ uint sum = 0; (int = 0; < vector_size/4; i++) sum += __vsadu4(qv[i], db_vector_cm[(i*num_db_vec)+idx]); result[idx] = sum; idx += griddim.x*blockdim.x;} } unsigned long compute_host_result(byte *qvec, byte *db_vec){ unsigned long temp = 0; (int =0; < num_db_vec; i++) (int j = 0; j < vector_size; j++) temp += (unsigned long) abs((int)qvec[j] - (int)db_vec[(i*vector_size)+j]); return temp; } int main(){ float et; cudaevent_t start, stop; byte *h_qvec, *d_qvec, *h_db_vec, *d_db_vec; uint32_t *h_res, *d_res; h_qvec = (byte *)malloc(vector_size*sizeof(byte)); h_db_vec = (byte *)malloc(vector_size*num_db_vec*sizeof(byte)); h_res = (uint32_t *)malloc(num_db_vec*sizeof(uint32_t)); (int = 0; < vector_size; i++){ h_qvec[i] = rand()%256; (int j = 0; j < num_db_vec; j++) h_db_vec[(j*vector_size)+i] = rand()%256;} cudamalloc(&d_qvec, vector_size*sizeof(byte)); cudamalloc(&d_db_vec, vector_size*num_db_vec*sizeof(byte)); cudamalloc(&d_res, num_db_vec*sizeof(uint32_t)); cudamemcpy(d_qvec, h_qvec, vector_size*sizeof(byte), cudamemcpyhosttodevice); cudamemcpy(d_db_vec, h_db_vec, vector_size*num_db_vec*sizeof(byte), cudamemcpyhosttodevice); cudaeventcreate(&start); cudaeventcreate(&stop); // initial run cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t)); cudaeventrecord(start); reduction_sum_abs<threads_per_block><<<num_db_vec, threads_per_block, threads_per_block*sizeof(int)>>>(d_qvec, d_db_vec, d_res); cudaeventrecord(stop); cudadevicesynchronize(); cudaeventsynchronize(stop); cudaeventelapsedtime(&et, start, stop); cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost); unsigned long h_result = 0; (int = 0; < num_db_vec; i++) h_result += h_res[i]; printf("1: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000)); if (h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n"); else printf("1: mismatch!\n"); // optimized kernel 1 cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t)); cudaeventrecord(start); reduction_sum_abs_opt1<<<num_db_vec, threads_per_block>>>(d_qvec, d_db_vec, d_res); cudaeventrecord(stop); cudadevicesynchronize(); cudaeventsynchronize(stop); cudaeventelapsedtime(&et, start, stop); cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost); h_result = 0; (int = 0; < num_db_vec; i++) h_result += h_res[i]; printf("2: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000)); if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n"); else printf("2: mismatch!\n"); // convert db_vec column-major storage optimized kernel 2 uint *h_db_vec_cm, *d_db_vec_cm; h_db_vec_cm = (uint *)malloc(num_db_vec*(vector_size/4)*sizeof(uint)); cudamalloc(&d_db_vec_cm, num_db_vec*(vector_size/4)*sizeof(uint)); (int = 0; < num_db_vec; i++) (int j = 0; j < vector_size/4; j++) h_db_vec_cm[(j*num_db_vec)+i] = *(reinterpret_cast<uint *>(h_db_vec + (i*vector_size))+j); cudamemcpy(d_db_vec_cm, h_db_vec_cm, num_db_vec*(vector_size/4)*sizeof(uint), cudamemcpyhosttodevice); cudamemset(d_res, 0, num_db_vec*sizeof(uint32_t)); cudaeventrecord(start); reduction_sum_abs_opt2<<<64, 512>>>(d_qvec, d_db_vec_cm, d_res); cudaeventrecord(stop); cudadevicesynchronize(); cudaeventsynchronize(stop); cudaeventelapsedtime(&et, start, stop); cudamemcpy(h_res, d_res, num_db_vec*sizeof(uint32_t), cudamemcpydevicetohost); h_result = 0; (int = 0; < num_db_vec; i++) h_result += h_res[i]; printf("3: et: %.2fms, bw: %.2fgb/s\n", et, (num_db_vec*vector_size)/(et*1000000)); if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("success!\n"); else printf("3: mismatch!\n"); return 0; } $ nvcc -o3 -arch=sm_35 -o t574 t574.cu $ ./run35 t574 1: et: 6.34ms, bw: 16.14gb/s success! 2: et: 4.16ms, bw: 24.61gb/s success! 3: et: 2.83ms, bw: 36.19gb/s success! $
a few notes:
- the above code, in particular kernel, must compiled cc3.0 or higher, way have test case set up. because creating 100,000 blocks in single 1d grid, cannot run as-is on cc2.0 device, example.
- there may additional slight tuning possible, if run on different devices, opt2 kernel, modifying grid , block parameters. have these set @ 64 , 512, , these values should not critical (although block should vector_size/4 threads or greater), since algorithm uses grid-striding loop cover entire vector set. gt640 has 2 sm's, in case 64 threadblocks enough keep device busy (probably 32 ok). might want modify these max performance on larger devices.
Comments
Post a Comment