Conversation

karolherbst 🐧 πŸ¦€

CPU code optimization is always fun. Those statements are all true statements:

- Doing more isn't slower than doing less
- Simpler code isn't faster than complex code
- Executing only one logical branch in an if-else isn't faster than executing both branches and picking the right result later.
- Branching isn't slower than not branching at all.

Premature optimization includes also "obvious" things, because you'll be wrong if you don't benchmark it.

It always depends on the exact thing.

7
5
1

And if this doesn't make sense to you, you simply don't know enough about CPUs

0
0
1

@karolherbst I think the hardest part is just how difficult it is to actually measure things correctly. Throwing a loop and a timer around some code != a useful measurement much of the time. :'(

1
0
0

@dotstdy yeah.. I just skip the micro benchmarking stuff and just check why apps are slow or if they have pointless CPU overhead I can get rid of somehow.

1
0
0

@karolherbst yeah mostly obvious issues are more about pointless work amplification than they are about microarchitecture, which is either good news or bad news depending on how you feel I guess xd

1
0
0

@dotstdy well the thing is that most of the time those overheads get reduced by adding more code. It's really rare that you end up deleting a bunch of code to make things faster.

Though microarch stuff matters for your super hot paths and then it becomes a bit of a pain to deal with random microarch nonsense.

0
0
0

@karolherbst

- for all of the above, which is faster may depend on the data

1
0
0

@amonakov and also on where the data is

0
0
0

@karolherbst

  • The thing that you measured and is faster will be the cause of slowdown 2-3 CPU generations later.
0
0
0
@karolherbst

Picking the point "Branching isn't slower than not branching at all" in particular that confused my intuition and actual measurement and asm analysis proved me wrong.

The case study was my attempt to optimize a comparator function, where there are multiple criteria, cascaded results of -1/0/1 comparison. The hypothesis was that encoding the 3 state results of the comparisons into one int value would be faster than a series of ifs. How wrong I was. The ifs always won, even for an artificial worst case (all ifs wrong), vs a fixed sequence of bit encoding returning one int as "strcmp" result.

The analysis was based on llvm-mca (instruction-level analysis, really great tool providing insights to micro arch optimizations). It was fun to reinvent new bit tricks only to find them on https://graphics.stanford.edu/~seander/bithacks.html . I guess the branch prediction is working well even for hard to predict data like sorting arrays.
1
4
7
@karolherbst

> Doing more isn't slower than doing less
> Simpler code isn't faster than complex code

I don't know if the state of the art tools are capable of that, but the points seem to be corresponding to human evaluation (counting lines of code on CPU instruction abstractions) rather than a model (even an imperfect one) that's closer to what actually happens.

The naive evaluation is 1 line of code has a cost 1. On the lower level, 1 asm instruction 1 unit. Then the memory access cost, not obvious from the code, depending on caching, prefetch, temporal/spatial locality, instruction ordering.

The best idea I have so far is to have a model based on physics with cost assigned by a local state (nearby instrucitons and effects) some sort of energy function, then global state (memory access, cache levels), and the dependencies or ordering constraints. This is too complex to grasp for a human so this probably why we'd be more likely wrong than right.

A smart compiler can afford to spend CPU cycles to find a good solution of the problem, code -> instructions reformulated in any way that gives the best results in a reasonable time.
1
0
0

@kdave @karolherbst it is a bit difficult to me to talk to complete strangers about this, but drawing conclusions of how branchy code is going to compare to branchless from llvm-mca does not sound like a good idea

(and while at it, if you like llvm-mca, please let me point you to uica.uops.info, it has a great deal of research behind it (see link on top, including a dissertation), and is more accurate)

0
0
0

@kdave @karolherbst the mental model you use for ballpark estimation of this kind of thing doesn't actually need to be particularly complex to be useful. mostly you're just worried about branches, memory operations, and instruction use. if you can structure things to not be a branchy mess, and to access memory efficiently, you're already well ahead of the pack. alas the compiler's ability to modify your code in this way is extremely limited, so it's also about knowing what makes compiler happy.

0
0
0

@karolherbst optimizing control flow to target the branch prediction hardware is the common thread of some of this. If branches are predictable they can be nearly free. If you can’t make them predictable then consider some kind of branchless trick ..

1
0
0

@acsawdey @karolherbst My favourite counterexample of this came from someone on the iOS team. When they got the first samples of a new chip, they found that one of their hot code paths got a lot slower. After a lot of digging, they discovered that the old chip had been mispredicting a branch. This caused a pipeline flush, which hurt a little bit but, as a side effect, pulled some data into cache. The newer CPU had a better branch predictor and so didn’t do the mispredict. They then hit the instruction that needed the data and had a much larger stall while they waited for main memory.

Once they’d discovered this, they were able to add a prefetch hint at the place where they used to see the branch mispredict and fix the issue.

1
0
0

@david_chisnall @acsawdey memory latency masking on the CPU is really an art on its own. On the GPU it's easy, because you have 10000 other threads the hw can switch to, but in your single threaded hot path on the CPU? good luck finding good heurestics, which isn't terrible across a bunch of different CPUs.

0
0
0