social.kernel.org

Conversation

karolherbst 🐧 🦀

CPU code optimization is always fun. Those statements are all true statements:

- Doing more isn't slower than doing less
- Simpler code isn't faster than complex code
- Executing only one logical branch in an if-else isn't faster than executing both branches and picking the right result later.
- Branching isn't slower than not branching at all.

Premature optimization includes also "obvious" things, because you'll be wrong if you don't benchmark it.

It always depends on the exact thing.

karolherbst 🐧 🦀

karolherbst@chaos.social

5 months ago

Reply to @karolherbst@chaos.social

And if this doesn't make sense to you, you simply don't know enough about CPUs

Josh Simmons

dotstdy@mastodon.social

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst I think the hardest part is just how difficult it is to actually measure things correctly. Throwing a loop and a timer around some code != a useful measurement much of the time. :'(

karolherbst 🐧 🦀

karolherbst@chaos.social

5 months ago

Reply to @dotstdy@mastodon.social

@dotstdy yeah.. I just skip the micro benchmarking stuff and just check why apps are slow or if they have pointless CPU overhead I can get rid of somehow.

Josh Simmons

dotstdy@mastodon.social

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst yeah mostly obvious issues are more about pointless work amplification than they are about microarchitecture, which is either good news or bad news depending on how you feel I guess xd

karolherbst 🐧 🦀

karolherbst@chaos.social

5 months ago

Reply to @dotstdy@mastodon.social

@dotstdy well the thing is that most of the time those overheads get reduced by adding more code. It's really rare that you end up deleting a bunch of code to make things faster.

Though microarch stuff matters for your super hot paths and then it becomes a bit of a pain to deal with random microarch nonsense.

Alexander Monakov

amonakov@mastodon.gamedev.place

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst

- for all of the above, which is faster may depend on the data

karolherbst 🐧 🦀

karolherbst@chaos.social

5 months ago

Reply to @amonakov@mastodon.gamedev.place

@amonakov and also on where the data is

David Chisnall (Now with 50% more sarcasm!)

david_chisnall@infosec.exchange

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst

The thing that you measured and is faster will be the cause of slowdown 2-3 CPU generations later.

kdave

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst

Picking the point "Branching isn't slower than not branching at all" in particular that confused my intuition and actual measurement and asm analysis proved me wrong.

The case study was my attempt to optimize a comparator function, where there are multiple criteria, cascaded results of -1/0/1 comparison. The hypothesis was that encoding the 3 state results of the comparisons into one int value would be faster than a series of ifs. How wrong I was. The ifs always won, even for an artificial worst case (all ifs wrong), vs a fixed sequence of bit encoding returning one int as "strcmp" result.

The analysis was based on llvm-mca (instruction-level analysis, really great tool providing insights to micro arch optimizations). It was fun to reinvent new bit tricks only to find them on https://graphics.stanford.edu/~seander/bithacks.html . I guess the branch prediction is working well even for hard to predict data like sorting arrays.

kdave

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst

> Doing more isn't slower than doing less
> Simpler code isn't faster than complex code

I don't know if the state of the art tools are capable of that, but the points seem to be corresponding to human evaluation (counting lines of code on CPU instruction abstractions) rather than a model (even an imperfect one) that's closer to what actually happens.

The naive evaluation is 1 line of code has a cost 1. On the lower level, 1 asm instruction 1 unit. Then the memory access cost, not obvious from the code, depending on caching, prefetch, temporal/spatial locality, instruction ordering.

The best idea I have so far is to have a model based on physics with cost assigned by a local state (nearby instrucitons and effects) some sort of energy function, then global state (memory access, cache levels), and the dependencies or ordering constraints. This is too complex to grasp for a human so this probably why we'd be more likely wrong than right.

A smart compiler can afford to spend CPU cycles to find a good solution of the problem, code -> instructions reformulated in any way that gives the best results in a reasonable time.

Alexander Monakov

amonakov@mastodon.gamedev.place

5 months ago

Reply to @kdave

@kdave @karolherbst it is a bit difficult to me to talk to complete strangers about this, but drawing conclusions of how branchy code is going to compare to branchless from llvm-mca does not sound like a good idea

(and while at it, if you like llvm-mca, please let me point you to uica.uops.info, it has a great deal of research behind it (see link on top, including a dissertation), and is more accurate)

Josh Simmons

dotstdy@mastodon.social

5 months ago

Reply to @kdave

@kdave @karolherbst the mental model you use for ballpark estimation of this kind of thing doesn't actually need to be particularly complex to be useful. mostly you're just worried about branches, memory operations, and instruction use. if you can structure things to not be a branchy mess, and to access memory efficiently, you're already well ahead of the pack. alas the compiler's ability to modify your code in this way is extremely limited, so it's also about knowing what makes compiler happy.

Aaron Sawdey, Ph.D.

acsawdey@fosstodon.org

5 months ago

Reply to @karolherbst@chaos.social

@karolherbst optimizing control flow to target the branch prediction hardware is the common thread of some of this. If branches are predictable they can be nearly free. If you can’t make them predictable then consider some kind of branchless trick ..

David Chisnall (Now with 50% more sarcasm!)

david_chisnall@infosec.exchange

5 months ago

Reply to @acsawdey@fosstodon.org

@acsawdey @karolherbst My favourite counterexample of this came from someone on the iOS team. When they got the first samples of a new chip, they found that one of their hot code paths got a lot slower. After a lot of digging, they discovered that the old chip had been mispredicting a branch. This caused a pipeline flush, which hurt a little bit but, as a side effect, pulled some data into cache. The newer CPU had a better branch predictor and so didn’t do the mispredict. They then hit the instruction that needed the data and had a much larger stall while they waited for main memory.

Once they’d discovered this, they were able to add a prefetch hint at the place where they used to see the branch mispredict and fix the issue.

karolherbst 🐧 🦀

karolherbst@chaos.social

5 months ago

Reply to @david_chisnall@infosec.exchange

@david_chisnall @acsawdey memory latency masking on the CPU is really an art on its own. On the GPU it's easy, because you have 10000 other threads the hw can switch to, but in your single threaded hot path on the CPU? good luck finding good heurestics, which isn't terrible across a bunch of different CPUs.

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org

karolherbst 🐧 🦀

karolherbst 🐧 🦀

Josh Simmons

karolherbst 🐧 🦀

Josh Simmons

karolherbst 🐧 🦀

Alexander Monakov

karolherbst 🐧 🦀

David Chisnall (*Now with 50% more sarcasm!*)

kdave

kdave

Alexander Monakov

Josh Simmons

Aaron Sawdey, Ph.D.

David Chisnall (*Now with 50% more sarcasm!*)

karolherbst 🐧 🦀

Terms of service

Privacy notice

Getting your own account

David Chisnall (Now with 50% more sarcasm!)

David Chisnall (Now with 50% more sarcasm!)