Woke up to a very nice email from the very own @graydon about source control. Learned a bunch. Now I just need to find time to write the other parts to my initial blog post :).
That said, I so much enjoy having people like Graydon and Marc reach out to me and tell a piece of their part of the story. Graydon told me his take on the fascinating story of how the idea of DAG's came to be in source control systems, originating in Monotone and thought up by a member of the community at the time.
@durin42 @dsp Well, you can draw a DAG on the structure of lots of distributed systems, I wasn't saying Monotone shipped the first or only system you can draw a DAG on (or did any sort of DAG-based reasoning)!
Merely that the narrow, specific DAG-construction technique of _naming each node in a history graph via a cryptographic hash function combining the current content and the node's immediate ancestors_ showed up first in Version Control in Monotone (due to Jerome Fisher's insight).
The two inputs had some prior art! Nothing happens in a vacuum. Recursive hashing of "trees in general" goes back to Ralph Merkle in 1979 and was present and known-to-us as a production technique to apply to filesystems such as the Venti filesystem, as a way of getting automatic deduplicating-CoW semantics. And recursive hashing of a history of states was present in R&D papers at least in the cryptography community under the name "Linked Timestamping" (usuaully just as a linear time sequence to prove a point-in-time).
But Monotone was the point where the two techniques were hybridized and this sort of "instantly and obviously very useful" hash-the-node-and-its-parents technique got started. And it's fairly clear if you look at the history of systems! All previous systems didn't use it, and most later ones did.
Including tons of systems outside VC. It's a very general distributed-system-building construction technique, that gets you a guaranteed global partial order on all events at O(1) storage and compute cost, requires no synchronized clocks or other coordination, and throws in strong immutable-history and integrity checking as a bonus.
@durin42 @dsp TLA (the whole Arch lineage in general) used a different technique. It was -- as far as I can tell, Tom Lord kinda resisted using common terminology in his work -- organized around assigning each changeset a unique name on construction, via a complex UUID-like "unlikely to collide" name-assigning scheme, and then reasoning about history states as sets-of-such-changesets, and the ancestry relationship between them as the subset-inclusion lattice on those sets.
This is not a bad technique either! It has some scaling and edge-case-interpretability problems, but it often matches people's intuitions, and produces fewer spurious equivalent-content nodes than a strict communication-causality DAG would. Eg. in Monotone (or Git or Mercurial or Fossil or ..) if I merge two non-conflicting changes A and B in the order A,B on one machine and the order B,A on another and then merge those two machines I wind up with 3 merge nodes in my history DAG like merge(merge(A,B),merge(B,A)) whereas a subset-inclusion system just sees B and A as unordered by conflict-ordering, imposes a canonical order by identity, and treats all 3 communication-causality events as equivalent, doesn't even bother recording them as separate objects.
Anyway I _think_ from this perspective TLA/Arch fits into the history of the field as a relative of Darcs and Pijul and so forth. Though a very idiosyncratic one that's genealogically distinct -- I don't think the authors had much contact or copied one another much. And again, I could be fairly wrong about its internals. I'm just fairly sure it _didn't_ use a linked cryptographic DAG construct (at least until its final attempt-at-rewriting, "revc", which kinda wanted to try adopting the technique but AIUI never got off the ground)
@durin42 @dsp Also worth mentioning that _all_ of these systems were taking place in the context of the obviously incredibly-useful and highly-coveted feature-set of BitKeeper: that each clone had (a) the full history and (b) could be worked-on offline and (c) could have sub-clones. So everything was incredibly fast, you didn't need a powerful central server and you could mostly ignore the root node or even your parent node as a political gatekeeper/bottleneck if you wanted to form a branch or record history concurrently.
People forget this about CVS. It was not just modestly centralized. It was _completely_ centralized. If you wanted to even get a _history log_ you had to connect to the central server. This had to be a powerful, fast, well-connected machine with 100% uptime if you didn't want to stall your dev team. And whoever controlled it controlled everything. God forbid you wanted to _commit_! You not only had to be online, you had to be up-to-date with the central server (i.e. have merged all conflicts), and this was a transient condition you were constantly racing on! And it wasn't atomic! If I committed concurrently with you, our commits would wind up interleaved, half the files getting my change first and half of them getting yours. Now officially neither historical state is reproducable!
When people _branched_ a CVS history, it was referred to as "cutting a branch" and took the server offline for hours while it churned away rewriting all the files to have a new level of branch-point-numbering in their version numbers. When people _forked_ a CVS history, it was a huge political event: involved gaining access to the server filesystem and getting a copy of all the ",v" history files, which the admin might not let you do. And afterwards you could not heal or merge the fork. It was a one-way action.
In contrast to this sort of nightmare, BitKeeper's distributed-replicas model was incredibly appealing. It just didn't happen to be free software. And also many of us thought it would be nice to go one step further than BitKeeper and try to find ways to make the replica clone-DAG not enforce an order on future merges, as (IIRC) it did. I.e. to go further in the decentralization direction, "somehow" (using some other technique to create orders). But BK was already super decentralized by comparison to CVS.
@pervognsen @regehr @durin42 @dsp Oh yeah a proper history includes like dozens-to-hundreds of different systems and stretches back to mainframe era.
@graydon @regehr @durin42 @dsp I used to joke that my first few programming jobs around high school used distributed version control: we didn't store the authoritative copy of the source code on a server, we used a peer-to-peer distributed system that involved synchronizing partial snapshots from peers with a voice-based locking protocol to prevent conflicts.
@pervognsen @graydon @durin42 @dsp email attachments were definitely the original decentralized version control
@regehr @pervognsen @durin42 @dsp I don't know how much you're joking by saying this but it's absolutely true: patches-by-email (and also netnews) are 100% ancestral and sometimes parallel-evolution / interop states for many of these systems. Git grew up in the culture where that had been the main Linux development approach for a long time, had only recently changed over to BitKeeper, and to this day Git has a ton of email patch interop stuff, lots of Git users run it as a subsystem they feed stuff to from their email client. And like .. even a "pull request" was so-named in relation to email requests people sent to LKML requesting someone pull from their git tree.
And not just Git! Arch was definitely oriented around patches-by-email interop. And Monotone originally included a (hand written by me) NNTP client and just posted tree-states and certificates to a local news server, then scanned it for new incoming states.
IMO it's kinda a tragedy of the web that we moved so many systems towards central hubs. The pre-web internet protocol landscape had far more systems that were meaningfully distributed-by-default (i.e. treated the exact server you were talking to as incidental and easily replaceable, or were fully peer-to-peer, making no distinction between client and server).
@regehr @pervognsen @graydon @durin42 @dsp memories of my advisor digging through his stale pine files to find specific versions of F77 code
@graydon @regehr @pervognsen @durin42 @dsp
A friend of mine fought a battle to get a team at a big-name consulting company you’ve heard of off of an “individual copies, patches by email, no other VCS” model in the mid-00s. They were •extremely resistant•, skeptical of svn. “It works for us!” was the attitude. (It worked terribly, in fact, but we accept and quickly stop seeing the problems we know.)
@steve @pervognsen @graydon @durin42 @dsp wait was it math-1.0-final.f or was it math-1.0-really-final-this-time.f?
@regehr @pervognsen @graydon @durin42 @dsp codes-1983-paper-revision-4-for-Mike-Baines-2.f
@inthehands @graydon @regehr @pervognsen @durin42 @dsp I like the spirit of "we should have kept email interop as a design goal" but (as someone who stress tested mail filters with incredibly elaborate Bugzilla work) that was extremely fragile, cumbersome, and anti-newcomer (since knowledge of how best to interoperate was hoarded at the edges).
Excuse me, I'm going to suggest to @monsieuricon that he should port the entire LKML process to ActivityPub and watch his head explode.
@regehr @durin42 @dsp I guess this is part of my "nothing in a vacuum" point though. Git didn't come from CVS. It came from BitKeeper and Monotone; and Monotone came from BitKeeper, Aegis, Venti, Linked Timestamps (and a dozen others we knew of); and BitKeeper came from TeamWare which came from SCCS. If you work through the history there's a nearly continuous "X plus an incremental change" pattern to development, not so much dramatic reorgs.
(IMO this is true of almost all intellectual history and one of the reasons I think it's both an interesting and frustrating field. People _remember_ large structural changes in how-things-happen as though they are anchored in single inventors and singular moments of invention, but when you zoom in on the history they never do. It's more like "incremental changes pile up to the point where a sea-change in the broader field becomes plausible and then later inevitable".)
@luis_in_brief @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon Oh, yeah, I don't mean to speak strictly in favour of decentralization either. I lean that way, but I think the distributed version control story has been really interesting as a sort of techno-socio-political natural experiment to see people given tools with a level of inherent radical decentralization and almost immediately re-centralize certain aspects of their work just for practicality sake (the most evident being things like github, but also in subtler ways like adopting privileged or hierarchical branch relationships and rebasing, rather than treating the branch space as flat and equal and requiring everyone to deal with the resulting somewhat chaotic partial order)
@graydon @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon 💯. Your toot about the way we tell the history of innovation, and how that is reflected in the RCS discussion, got a standing ovation in my head.
@luis_in_brief @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon As @susannah put it: "people mostly want the simplicity of centralization, but only insofar as it can exist without causing themselves personally any harm or inconvenience"