Conversation
Dear semi-lazyweb,

Given a git diff of a C/Rust codebase, how to best determine which functions/defines have been modified between the two versions? Yes, the diff itself sometimes gives hints as to what has changed, but it's not always correct. Think about when it modifies the start of a function, but the diffstat "name" shows the previous function, a correct marking, but not what is needed.

Is the correct answer really going to be "compile the two versions and compare the AST" or something like that? No "diff library" somewhere that "knows" how to parse C (and Rust) that can do this in a faster way? Surely I'm missing something obvious here...
6
9
13

@wilfredh labelling hunks with function names should be possible to add to difftastic, no? @gregkh

1
0
0

@gregkh @mxk that sounds like a problem that https://mergiraf.org/ must have already solved, maybe you can steal their code to do something similar? I don't think they have an API for that, though.

1
0
0
@gregkh how much do you care about handling preprocessor weirdness correctly?
1
0
0
@chinmay @wilfredh difftastic is great, but a visual tool, not something that can be hooked up into a script to generate the list of modified symbols as far as I can tell. But again, I might be missing an option there, am I?
1
0
2
@mei Not all that much, "best effort" is fine.
0
0
0

@gregkh @wilfredh it doesn't label hunks yet - if/when that feature is added you could probably grep for label markers to get the list

1
0
0
@neverpanic @mxk I thought mergiraf is for merges, not a "simple" diff, that has no merge conflicts, am I missing something?

And yes, "taking code from elsewhere" is great, and I can do that, but wanted to make sure there wasn't something out there that already handled all of this before going down that yak-shaving path...
2
0
2
@chinmay @wilfredh So, "if a new option was added to this tool" kind of doesn't solve the issue today, or even tomorrow, if no one is adding such an option. Are you suggesting I do that? That's fine if you are, maybe that's the best end result, don't know, just want to make sure I'm not missing something already out there today.
1
0
2

@gregkh right, i'm sorry i should have been more clear - wasn't even suggesting "you" do it, just a possibility that came to mind @wilfredh

0
0
0

@gregkh @mxk You're not missing anything, but their code-aware conflict resolution requires them to parse the code in all three states of the conflict. So yeah, this would be taking code from elsewhere, not a turnkey solution.

IIRC the use the treesitter parsers for that functionality.

1
0
0

@gregkh this seems like something Magit can do 🤔

0
0
0

@gregkh @neverpanic @mxk ping @pintoch ? I thought of saying something to that but maybe I'm not the best person for it ^^

1
0
0

@gregkh

The git diff pager 'delta' shows changed methods above diffing lines, also when they are not part of the diff. I suggest checking it out and integrating it into git.

Does it also only use the diffstat 'name' or does it do its own resolution?

https://dandavison.github.io/delta/introduction.html

via @oneiros

0
0
0

@neverpanic @gregkh @mxk The question is whether this suffices, as you also mentioned defines, which can cause a local change to change codegen of all usage sites.
The heavyweight way to do this is probably what cHash is doing (no matter at what representation level, so either at treesitter or LLVM-IR): paper: https://www.usenix.org/system/files/conference/atc17/atc17-dietrich.pdf code: https://github.com/luhsra/chash

So hash the AST of all toplevel entities in the translation unit before and after, and then make a diff of the hashes...

2
0
0

@neverpanic @gregkh @mxk concerning cHash, @stettberger is probably the person to talk to.

Otherwise, there is also difftastic in the "related tools" category: https://difftastic.wilfred.me.uk/
Basically diff on treesitter output; I'm not sure they offer the requested functionality as is, but should hold all basic ingredients by necessity.

0
0
0

@n0toose @gregkh @neverpanic @mxk what mergiraf uses is the GumTree algorithm, described quite succinctly here: https://mergiraf.org/architecture.html#matching and more in detail in https://hal.science/hal-01054552

This matching may or may not be the notion of diff that is useful for you…

0
0
0

@noctux @neverpanic @gregkh @mxk at this point, we also have IRhash, which is almost as performant (in terms of saving) in reusing build artifact while being more precise (less false misses).

0
0
0

@gregkh maybe experiment with the different ways of generating diffs?

The git diff sub-command has an option called algorithm, which defines how the diff is computed. If I remember correctly Myers is the default and its not that great. Histogram gave me the best results so far.

1
0
0
@jlhertel That doesn't seem to actually provide the "what function/symbol is modified" logic, nor would I expect it to, unless I am missing something obvious?
1
0
0

@gregkh it doesn't provide an exact list of symbols modified, but it generates a better diff, which allows to more easily see what has been touched.

So you are correct, it doesn't solve you problem exactly, but I think it might help anyway.

0
0
0