2012-04-12
stristr and findiStr with SSE4.2
2012-04-04
Quick search algorithm and strstr
Quick Search algorithm(Qs) is a string search algorithm. Qs is a simplification of Boyer-Moore algorithm and faster than it.
cf. http://www-igm.univ-mlv.fr/~lecroq/string/node19.html
So I measured the speed of Qs on Core 2 Duo and found it is faster than strstr for a key string with 4~6 bytes or longer length.
But latest Intel CPU such as Sandy Bridge(Core i7-2600K) or Xeon 5650 has SSE4.2 (string and test processing instructions). So I compared strstr_sse42 with Qs then found strstr_sse42 is 2~4 times faster than Qs for a key string with at least 32-bytes length.
The test code is https://github.com/herumi/opti/blob/master/str_util_test.cpp.
My implementation of strstr is https://github.com/herumi/opti/blob/master/str_util.hpp. It requires Xbyak.
The main code is
/* strstr(a, key); input key use save_a, save_key, c */ movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); // if (CF == 0 and ZF = 0) goto .lp } else { jbe(".headCmp"); // if (CF == 1 or ZF == 1) goto .headCmp add(a, 16); jmp(".lp"); L(".headCmp"); } jnc(".notFound"); ...
It is slower to use lea with pcmpistri on Xeon 5650 but a little faster on Sandy Bridge, so I check the type of CPU and generate better code according to each CPU.
I make a slide for detail, so please see http://www.slideshare.net/herumi/x86opti3.