Quantcast
Channel: Recent posts
Viewing all articles
Browse latest Browse all 19

Looking for efficient way to convert float (32 bit) aligned buffer to short (16 bit) aligned buffer

$
0
0

I wrote a c-code and an AVX code to convert an alignedbuffer of size 1920*1280*3 from float to short.
The AVX implementation is 3 times slower than the c-code.

Here is the AVX code for the float2short:

for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)

{

__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));

__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));

__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));

__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);

__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);

pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);

__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);

__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);

pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);

__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);

__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);

pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);

}

As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).


Viewing all articles
Browse latest Browse all 19

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>