Skip to content

Complexity comparision of SIMD based algorithm implemenation

An answer to this question on Stack Overflow.

Question

I am just curious to know, what is the way to measure the complexity of an algorithm implemented by two different SIMD say SSE3 and CUDA. Normally we compare algorithm complexity with Big-O notation. Is there any such way to compare runtime improvement with SIMD ?

If somebody ask how much an algorithm will improve if you run it on GPU. Can you measure it theoretically ? without running the benchmark on both CPU and GPU. ?

Note: I understand what Big-O is . So, all I want to know is how SSE3 performs compared to CUDA or the CPU based implementation for the same algorithm without raw bench-marking

Answer

Big-O generally speaks about how an algorithm scales as the number of items considered, N, grows.

So, for an O(N) algorithm, if you have 10 times the data, you expect roughly 10 times the run-time. If you have an O(n log₂ n) algorithm, 10 times the data gives you ~33x more work.

Roughly speaking, CUDA and GPUs parallelize operations across p cores. The total work done is then given by W=pt, where p is the number of cores and t the time complexity of each core's operation. You could, for instance, sort N items using √N processors each doing O(√N log N) work. The total time complexity of the algorithm is still O(N log N), but the algorithm is thought of as "work-efficient" because its total time complexity is the same (or less than) the best known serial algorithm. So Big-O analysis can tell you something about how algorithms run in parallel environments.

However, it is not the tool you are looking for. Big-O is for talking about how algorithms scale: it can't tell us how an algorithm will perform on different hardware.

For a theoretical approach to performance, you'll want to look into Amdahl's law and Gustafson's law, which provide simple metrics of how much improvement we can expected in programs when their resources are increased. They both boil down to acknowledging that the total increase in speed is a function of the portion of the program we can speed up and the amount we can speed it up by.

Can you do better than that? Yes. There are more sophisticated models you can use to determine how your code might perform in different environments, looking into the roofline model (alternative link) might get you started.

Beyond this, your question gets too specific to be answered well here. Running the benchmarks is likely to be your personal best option.