Running Open Source In A Distributed World

nuget, code, open source 0 comments suggest edit

When it comes to running an open source project, the book Producing Open Source Software - How to Run a Successful Free Software Project by Karl Fogel (free pdf available) is my bible (see my review and summary of the book).

The book is based on Karl Fogel’s experiences as the leader of the Subversion project and has heavily influenced how I run the projects I’m involved in. Lately though, I’ve noticed one problem with some of his advice. It’s so Subversion-y.

Take a look at this snippet on Committers.

As the only formally distinct class of people found in all open source projects, committers deserve special attention here. Committers are an unavoidable concession to discrimination in a system which is otherwise as non-discriminatory as possible. But “discrimination” is not meant as a pejorative here. The function committers perform is utterly necessary, and I do not think a project could succeed without it.

A Committer in this sense is someone who has direct commit access to the source code repository. This makes sense in a world where your source control is completely centralized as it would be with a Subversion repository. But what about a world in which you’re using a completely decentralized version control like Git or Mercurial? What does it mean to be a “committer” when anyone can clone the repository, commit to their local copy, and then send a pull request?

In the book, Mercurial: The Definitive Guide, Bryan O’Sullivan discusses different collaboration models. The one the Linux kernel uses for example is such that Linus Torvalds maintains the “master” repository and only pulls from his “trusted lieutenants”.

At first glance, it might seem reasonable that a project could allow anyone to send a pull request to main and thus focus the “discrimination”, that Karl mentions, on the technical merits of each pull request rather than the history of a person’s involvement in the project.

One one level, that seems even more merit based egalitarian, but you start to wonder if that is scalable. Based on the Linux kernel model, it clearly is not scalable. As Karl points out,

Quality control requires, well, control. There are always many people who feel competent to make changes to a program, and some smaller number who actually are. The project cannot rely on people’s own judgement; it must impose standards and grant commit access only to those who meet them.

Many projects make a distinction between who may contribute a bug fix as opposed to who may contribute a feature. Such projects may require anyone contributing a feature or a non-trivial bug fix to sign a Contributor License Agreement. This agreement becomes the gate to being a contributor, which leaves me with the question, do we go through the process of getting this paperwork done for anyone who asks? Or do we have a bar to meet before we even consider this?

On one hand, if someone has a great feature idea, wouldn’t it be nice if we could just pull in their work without making them jump through hoops? On the other hand, if we have a hundred people go through this paperwork process, but only one actually ends up contributing anything, what a waste of our time. I would love to hear your thoughts on this.

NuGet, a package manager project I work on is currently following the latter approach as described in our guide to becoming a core contributor, but we’re open to refinements and improvements. I should point out that a hosted Mercurial solution does support the centralized committer model where we provide direct commit access. It just so happens that while some developers in the NuGet project have direct commit access, most don’t and shouldn’t make use of it per project policy as we’re still following a distributed model. We’re not letting the technical abilities/limitations of our source control system or project hosting define our collaboration model.

I know I’m late to the game when it comes to distributed source control, but it’s really striking to me how it’s turned the concept of committers on its head. In the centralized source control world, being a contributor was enforced via a technical gate, either you had commit access or you didn’t. With distributed version control it’s become more a matter of social contract and project policies.

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.



9 responses

  1. Avatar for wekempf
    wekempf October 11th, 2010

    I haven't read the book to know exactly how the terms are being defined here, but at a high level I don't think DVCSs change things that much. A project still has an "official" repository. This repository still has a limited number of people who can "commit" to it. This would be your committers.
    Yes, forking is easier. Yes, contributing without being a committer is easier (and maybe most of a project's contributors won't be committers in this model). But quality control is squarely on the shoulders of an "elite" few. Someone makes a pull request and the "committer" is responsible for vetting the change set(s) to ensure they meet the requirements defined for the project. If they don't, either the committer "fixes" things before committing, or rejects the pull request and asks the requester to fix it and submit a new pull request. The process changes a little, but the responsibilities and checks and balances don't.
    There's definite benefits to the DVCS model. Initially I'd have new contributors on "probation", without commit authority. After they've contributed for a while, and shown to provide quality pull requests, then I'd consider promoting them to be committers. In theory this is what you do with any VCS, the problem is in how difficult it is to contribute without having commit access with a traditional VCS.

  2. Avatar for Dylan Beattie
    Dylan Beattie October 12th, 2010

    I think that for a project to gain traction, there needs to be a core developer (or developers) who maintains absolute control over the focus of the project, and an official codebase, with a sensible release schedule, that reflects that focus. It's a tough balance. Accept too many contributions and you end up with feature creep. Release official builds too frequently, and people get fed up with building against a moving target. Accept too few patches or don't release often enough, and people get impatient and fork your project, splintering your potential contributors, your user base, your sense of 'brand identity'.
    There always going to be people running on the trunk - in fact, one of the things that's nice about github is it makes it easy to maintain your own personal fork of an OS project. Those people aren't the problem.
    I think the core committers' responsibility should be making sure that each official release is well-defined, bug-free, documented, stable, and sufficiently long-lived that there's time for the community to get to grips with it. The discipline of enforcing feature-freezes and release schedules is what makes an open-source project accessible to people who don't necessarily want to build from source and read the code in lieu of documentation.
    People who desperately need the latest bleeding-edge pre-alpha features can always cherry-pick from each others' repos. The rest of us just want a stable, documented API that we know isn't going to be deprecated tomorrow by a fresh nightly update that might, or might not, integrate cleanly into our project.
    In other words, put my vote against "benevolent dictatorship", keep the trunk lean and clean, and release often, but not too often :)

  3. Avatar for Karl Fogel
    Karl Fogel October 12th, 2010

    Hi, Phil -- thanks for your review of the book from before, and for your thoughtful comments now.
    I can't remember if this is in the print edition or not, but in the online version there's a footnote about this exact issue in the section on committers:
    It's short; I'll just paste it in:

    Note that the commit access means something a bit different in decentralized version control systems, where anyone can set up a repository that is linked into the project, and give themselves commit access to that repository. Nevertheless, the concept of commit access still applies: "commit access" is shorthand for "the right to make changes to the code that will ship in the group's next release of the software." In centralized version control systems, this means having direct commit access; in decentralized ones, it means having one's changes pulled into the main distribution by default. It is the same idea either way; the mechanics by which it is realized are not terribly important.

    Having worked with distributed version control in some projects and centralized version control in others, I can't say I've felt a huge social difference between the two. Have you?
    I think the phrase "commit access" has become like the word "patch": a formerly specific word that has come to stand for the general concept for which the specific thing was once the only example.
    A "patch" used to be a patch file: output from the 'diff' program that could be fed through the 'patch' program to be applied to a base. But nowadays "patch" has come to mean "a discrete unit of change applied to source code", possibly including the addition or deletion of files or even directories.
    Likewise, "commit access" now means "having one's changes accepted into the master copy by default, instead of needing to be explicitly approved each time". How this acceptance is obtained is a question of project governance and, ultimately, physical control over the server(s) where the master copy is stored (and of course it requires agreement about which copy is the "master", which can be missing in, e.g., hostile fork situations).
    But my personal experience so far suggests it doesn't have much to do with the version control system the master copy is kept in, nor the version control systems the various developers use (which are usually, but not always, the same as the one used for the master copy -- think git-svn, for example).

  4. Avatar for Karl Fogel
    Karl Fogel October 12th, 2010

    (My other, longer comment says it got flagged as spam -- I hope you moderate it through! :-) )
    To answer your question about collecting contributor license agreements:
    One simple way is, just get the agreement from each contributor the first time any change of theirs is approved for incorporation into the code. No matter whether it's a large feature or a small bugfix -- the contributor form is a small, one-time effort, so even for a tiny bugfix it's still worth it (on the theory that the person is likely to contribute again, and the cost of collecting the form is amortized over all the contributions that person ever makes anyway).
    The trick is not to collect it when someone starts talking about making their first patch. Wait until the patch is done & approved, then collect the form. No one who's gotten all the way to patch approval is going to balk at signing the form, unless they object to it in principle (but that's pretty rare).

  5. Avatar for Jonathan van de Veen
    Jonathan van de Veen October 12th, 2010

    Another way to make sure you're not missing out on this one great feature idea is to make it possible for people to do a submit once to a seperate repository. Any contributions in that repository obviously require a code review (which involves work, I know).
    If someone posts a second piece of code to that repository, it then becomes time to discuss an agreement, as they are likely to contribute again.
    This way you potentially have a lot less administrative work and any work you have to put in reviewing code is likely to lead to a good addition to your project.

  6. Avatar for Joe Brinkman
    Joe Brinkman October 12th, 2010

    @Phil - I touched on many of the committer issues including the CLA in my post a few years back. DotNetNuke has followed the same rules that Karl notes above, whenever someone has some non-trivial, tangible change that is ready to commit, we ask them for a CLA if we don't already have one on record. We don't ask for CLAs for one or two line bug fixes, but when you start submitting whole functions and classes, you can be sure that we will ask for the CLA before anything gets into the central repository.
    Ultimately, most of the issues around Open Source governance have little to do with the tools you use or your development methodology.

  7. Avatar for haacked
    haacked October 12th, 2010

    @Karl thanks for the feedback! I have a physical copy of your book so I probably missed that online note. I really like your tip about the CLA. I think we'll probably change to that approach.

  8. Avatar for Jakub Narębski
    Jakub Narębski October 13th, 2010

    Actually nowadays Linux kernel development scales quite well, thanks to the fact that with distributed version control system such as Git you can set up a network of trust, or in other words hierarchical structure of repositories. For Linux kernel development it means Linus' lieutienants and subsystem repositories (and also repositories such as kernel janitors project with trivial patches). Though there was LWN article about how said hierarchy of repositories in Linux development isquite shallow, and should be deeper to scale better.
    Besides many OSS project accept not only pull requests (be it via email, or e.g. via GitHub message/notification system) or (rarely) pushes, but also sending patches (be it via email, or putting in issue tracker, or patch review board). With distributed version control system a leaf / accidental / new contributor can clone a full repository and do his/her work with all the advantages of version control system, even if development of a feature takes more than one commit. All that without need to contact main developers / maintainer / leader of said project. The he/she can update his/her work to be on top of latest changes (in Git: rebase) and send series of patches to be reviewed, and ultimately (if they are accepted) applied.

  9. Avatar for haacked
    haacked October 13th, 2010

    @Jakub Thanks! Yeah, I pointed that out, though after reading what I said, it sounds like I was saying the opposite. My point was that the reason Linux scales is that they moved to this hierarchical model you described. I doubt having a completely flat model would scale, and I think Linux is perhaps evidence of that.