@tobinbaker If it makes you feel better, Linux-kernel RCU uses a similar optimization in its callback lists. We are actually looking towads a combined scheme where we use the global ->gp_seq, but only for groups of callbacks.
On the other hand, the polled APIs, poll_state_synchronize_rcu() and friends, access the global information. If these become heavily used, this might need to change. Though commodity systems are continuing to become more tightly integrated with increasing fractions of hardware going to GPGPUs, so who knows?