Conversation
Edited 6 months ago

Woke up to a very nice email from the very own @graydon about source control. Learned a bunch. Now I just need to find time to write the other parts to my initial blog post :).

That said, I so much enjoy having people like Graydon and Marc reach out to me and tell a piece of their part of the story. Graydon told me his take on the fascinating story of how the idea of DAG's came to be in source control systems, originating in Monotone and thought up by a member of the community at the time.

2
0
0

@dsp @graydon did tla not use DAGs at all? It's been too long since I used it.

2
0
0

@durin42 @graydon Maybe my toot was inexact, he mentioned how it came to be in monotone for the most part. Maybe TLA had it. I never used TLA. It's quite tricky to do the forensics of who was first :)

0
0
0

@dsp @graydon ohh, looking forward to this blog post!

1
0
0

@djc @graydon sadly, I don't think it will happen anytime soon. I just started at Anthropic and deep into work. Work and a toddler at home just leaves too little time to write lengthy blog posts that need good research.

0
0
0

@durin42 @dsp Well, you can draw a DAG on the structure of lots of distributed systems, I wasn't saying Monotone shipped the first or only system you can draw a DAG on (or did any sort of DAG-based reasoning)!

Merely that the narrow, specific DAG-construction technique of _naming each node in a history graph via a cryptographic hash function combining the current content and the node's immediate ancestors_ showed up first in Version Control in Monotone (due to Jerome Fisher's insight).

The two inputs had some prior art! Nothing happens in a vacuum. Recursive hashing of "trees in general" goes back to Ralph Merkle in 1979 and was present and known-to-us as a production technique to apply to filesystems such as the Venti filesystem, as a way of getting automatic deduplicating-CoW semantics. And recursive hashing of a history of states was present in R&D papers at least in the cryptography community under the name "Linked Timestamping" (usuaully just as a linear time sequence to prove a point-in-time).

But Monotone was the point where the two techniques were hybridized and this sort of "instantly and obviously very useful" hash-the-node-and-its-parents technique got started. And it's fairly clear if you look at the history of systems! All previous systems didn't use it, and most later ones did.

Including tons of systems outside VC. It's a very general distributed-system-building construction technique, that gets you a guaranteed global partial order on all events at O(1) storage and compute cost, requires no synchronized clocks or other coordination, and throws in strong immutable-history and integrity checking as a bonus.

1
2
0
Long post
Show content

@durin42 @dsp TLA (the whole Arch lineage in general) used a different technique. It was -- as far as I can tell, Tom Lord kinda resisted using common terminology in his work -- organized around assigning each changeset a unique name on construction, via a complex UUID-like "unlikely to collide" name-assigning scheme, and then reasoning about history states as sets-of-such-changesets, and the ancestry relationship between them as the subset-inclusion lattice on those sets.

This is not a bad technique either! It has some scaling and edge-case-interpretability problems, but it often matches people's intuitions, and produces fewer spurious equivalent-content nodes than a strict communication-causality DAG would. Eg. in Monotone (or Git or Mercurial or Fossil or ..) if I merge two non-conflicting changes A and B in the order A,B on one machine and the order B,A on another and then merge those two machines I wind up with 3 merge nodes in my history DAG like merge(merge(A,B),merge(B,A)) whereas a subset-inclusion system just sees B and A as unordered by conflict-ordering, imposes a canonical order by identity, and treats all 3 communication-causality events as equivalent, doesn't even bother recording them as separate objects.

Anyway I _think_ from this perspective TLA/Arch fits into the history of the field as a relative of Darcs and Pijul and so forth. Though a very idiosyncratic one that's genealogically distinct -- I don't think the authors had much contact or copied one another much. And again, I could be fairly wrong about its internals. I'm just fairly sure it _didn't_ use a linked cryptographic DAG construct (at least until its final attempt-at-rewriting, "revc", which kinda wanted to try adopting the technique but AIUI never got off the ground)

1
0
0
Long post
Show content

@durin42 @dsp Also worth mentioning that _all_ of these systems were taking place in the context of the obviously incredibly-useful and highly-coveted feature-set of BitKeeper: that each clone had (a) the full history and (b) could be worked-on offline and (c) could have sub-clones. So everything was incredibly fast, you didn't need a powerful central server and you could mostly ignore the root node or even your parent node as a political gatekeeper/bottleneck if you wanted to form a branch or record history concurrently.

People forget this about CVS. It was not just modestly centralized. It was _completely_ centralized. If you wanted to even get a _history log_ you had to connect to the central server. This had to be a powerful, fast, well-connected machine with 100% uptime if you didn't want to stall your dev team. And whoever controlled it controlled everything. God forbid you wanted to _commit_! You not only had to be online, you had to be up-to-date with the central server (i.e. have merged all conflicts), and this was a transient condition you were constantly racing on! And it wasn't atomic! If I committed concurrently with you, our commits would wind up interleaved, half the files getting my change first and half of them getting yours. Now officially neither historical state is reproducable!

When people _branched_ a CVS history, it was referred to as "cutting a branch" and took the server offline for hours while it churned away rewriting all the files to have a new level of branch-point-numbering in their version numbers. When people _forked_ a CVS history, it was a huge political event: involved gaining access to the server filesystem and getting a copy of all the ",v" history files, which the admin might not let you do. And afterwards you could not heal or merge the fork. It was a one-way action.

In contrast to this sort of nightmare, BitKeeper's distributed-replicas model was incredibly appealing. It just didn't happen to be free software. And also many of us thought it would be nice to go one step further than BitKeeper and try to find ways to make the replica clone-DAG not enforce an order on future merges, as (IIRC) it did. I.e. to go further in the decentralization direction, "somehow" (using some other technique to create orders). But BK was already super decentralized by comparison to CVS.

1
0
0
Long post
Show content

@graydon @durin42 @dsp I enjoy keeping track of my own failures of imagination, and this is one of them. I could have pointed out a dozen ways to improve SVN but I would have never come up with something like git.

2
0
0
Long post
Show content

@graydon @durin42 @dsp however, I feel like I might well have come up with RCS given SCCS, CVS given RCS, etc.

1
0
0
Long post
Show content

@regehr @graydon @durin42 @dsp Left out of all this is Microsoft Visual SourceSafe which might be a championship contender with Super Safe Intelligence, Inc for non-truth in advertising.

1
0
0
Long post
Show content

@pervognsen @regehr @durin42 @dsp Oh yeah a proper history includes like dozens-to-hundreds of different systems and stretches back to mainframe era.

1
0
0
Long post
Show content

@graydon @regehr @durin42 @dsp I used to joke that my first few programming jobs around high school used distributed version control: we didn't store the authoritative copy of the source code on a server, we used a peer-to-peer distributed system that involved synchronizing partial snapshots from peers with a voice-based locking protocol to prevent conflicts.

1
0
0
Long post
Show content

@pervognsen @graydon @durin42 @dsp email attachments were definitely the original decentralized version control

2
0
0
Long post
Show content

@regehr @pervognsen @durin42 @dsp I don't know how much you're joking by saying this but it's absolutely true: patches-by-email (and also netnews) are 100% ancestral and sometimes parallel-evolution / interop states for many of these systems. Git grew up in the culture where that had been the main Linux development approach for a long time, had only recently changed over to BitKeeper, and to this day Git has a ton of email patch interop stuff, lots of Git users run it as a subsystem they feed stuff to from their email client. And like .. even a "pull request" was so-named in relation to email requests people sent to LKML requesting someone pull from their git tree.

And not just Git! Arch was definitely oriented around patches-by-email interop. And Monotone originally included a (hand written by me) NNTP client and just posted tree-states and certificates to a local news server, then scanned it for new incoming states.

IMO it's kinda a tragedy of the web that we moved so many systems towards central hubs. The pre-web internet protocol landscape had far more systems that were meaningfully distributed-by-default (i.e. treated the exact server you were talking to as incidental and easily replaceable, or were fully peer-to-peer, making no distinction between client and server).

1
0
0
Long post
Show content

@regehr @pervognsen @graydon @durin42 @dsp memories of my advisor digging through his stale pine files to find specific versions of F77 code

1
0
0
Long post
Show content

@graydon @regehr @pervognsen @durin42 @dsp
A friend of mine fought a battle to get a team at a big-name consulting company you’ve heard of off of an “individual copies, patches by email, no other VCS” model in the mid-00s. They were •extremely resistant•, skeptical of svn. “It works for us!” was the attitude. (It worked terribly, in fact, but we accept and quickly stop seeing the problems we know.)

1
0
0
Long post
Show content

@steve @pervognsen @graydon @durin42 @dsp wait was it math-1.0-final.f or was it math-1.0-really-final-this-time.f?

1
0
0
Long post
Show content

@regehr @pervognsen @graydon @durin42 @dsp codes-1983-paper-revision-4-for-Mike-Baines-2.f

0
0
0
Long post
Show content

@inthehands @graydon @regehr @pervognsen @durin42 @dsp I like the spirit of "we should have kept email interop as a design goal" but (as someone who stress tested mail filters with incredibly elaborate Bugzilla work) that was extremely fragile, cumbersome, and anti-newcomer (since knowledge of how best to interoperate was hoarded at the edges).

Excuse me, I'm going to suggest to @monsieuricon that he should port the entire LKML process to ActivityPub and watch his head explode.

2
0
0
Long post
Show content

@regehr @durin42 @dsp I guess this is part of my "nothing in a vacuum" point though. Git didn't come from CVS. It came from BitKeeper and Monotone; and Monotone came from BitKeeper, Aegis, Venti, Linked Timestamps (and a dozen others we knew of); and BitKeeper came from TeamWare which came from SCCS. If you work through the history there's a nearly continuous "X plus an incremental change" pattern to development, not so much dramatic reorgs.

(IMO this is true of almost all intellectual history and one of the reasons I think it's both an interesting and frustrating field. People _remember_ large structural changes in how-things-happen as though they are anchored in single inventors and singular moments of invention, but when you zoom in on the history they never do. It's more like "incremental changes pile up to the point where a sea-change in the broader field becomes plausible and then later inevitable".)

1
1
0
Long post
Show content

@graydon @durin42 @dsp good point! I think that I just paid much less attention to the systems outside of the rcs->cvs->svn->git pathway

0
0
0
re: Long post
Show content
@luis_in_brief @inthehands @graydon @regehr @pervognsen @durin42 @dsp It'll still be fragile, cumbersome and anti-newcomer, but in addition we will also lose all the abuse prevention layers of email, such as spam filtering, node IP reputation, etc. Plus, ActivityPub is great for one-to-many and one-to-one communication, but it's quite terrible at many-to-many interactions common on mailing lists. I mean, you have to run something like FediFetcher just to see all the replies on a thread that you didn't start.
0
0
3
Long post
Show content

@luis_in_brief @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon Oh, yeah, I don't mean to speak strictly in favour of decentralization either. I lean that way, but I think the distributed version control story has been really interesting as a sort of techno-socio-political natural experiment to see people given tools with a level of inherent radical decentralization and almost immediately re-centralize certain aspects of their work just for practicality sake (the most evident being things like github, but also in subtler ways like adopting privileged or hierarchical branch relationships and rebasing, rather than treating the branch space as flat and equal and requiring everyone to deal with the resulting somewhat chaotic partial order)

2
0
0
Long post
Show content

@graydon @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon 💯. Your toot about the way we tell the history of innovation, and how that is reflected in the RCS discussion, got a standing ovation in my head.

0
0
0
Long post
Show content

@luis_in_brief @inthehands @regehr @pervognsen @durin42 @dsp @monsieuricon As @susannah put it: "people mostly want the simplicity of centralization, but only insofar as it can exist without causing themselves personally any harm or inconvenience"

0
1
0