Distributed version control

From Citizendium
Jump to navigation Jump to search
This article is a stub and thus not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

Distributed version control systems such as Git and Mercurial have emerged in the last few years as competitors to older centralized version control systems such as Subversion and CVS.

Terminology

Version control systems can be thought of as incremental backup systems, typically used for software code in development. Such systems go by many names.

A tool that manages and tracks different versions of software or other content is referred to generically as a version control system (VCS), a source code manager (SCM), a revision control system (RCS), and with several other permutations of the words "revision," "version," "code," "content," "control," "management," and "system." [...] [E]ach system addresses the same issues: develop and maintain a repository of content, provide access to historical editions of each datum, and record all changes in a log.[1]

A version control system is considered a distributed version control system if each client using the system has a full copy of source code's history.

History

Local version control

Local version control comes in two flavors: a simple file naming convention - such as file1.c, file2.c or file1.c.bak in which the entire file contents are saved. While this isn't really a version control "system," it is a common practice and worth mentioning to help those who are new to versioning become somewhat comfortable. The Unix command 'rcs' on the otherhand, is a true version control mechanism that keeps a database of the changes. That is, it records the changes or differences made within a file instead of a complete copy of the files. The rcs command also allows the author to rollback to a point and/or undo changes.

Centralized version control

As the software projects grew in size, resources and scope, a need for a more organized and collaborative version control arose. The first centralized versioning systems took the rcs (revisioning control system) local workstation mechanics and applied it to a central repository server in which developers would push their changes to this central shared repository and shared by the entire development team.

Centralized version control is often built in to most IDEs such as Eclipse, NetBeans and Visual Studio and because of the maturity and availability of central version control, it is the most popular methods

Distributed version control

Linux kernel crisis

In 2002, the decision was made to use BitKeeper by BitMover as the version control solution for the Linux kernel. Up to that point the kernel had not used any kind of version control. The decision to use proprietary software encountered stiff opposition from many free software advocates who argued that free software projects such as the Linux kernel should not employ any proprietary tools in its making. In 2005 the agreement between the Linux kernel project and BitMover collapsed amid allegations that BitKeeper was being reverse engineered by free software supporter Andrew Tridgell. On April 6 Linus Torvalds announced in an email[1] that the kernel would cease to use BitKeeper. On April 29, just a little over three weeks later, the popular distributred version control system "Git" was born.

Popular DVCs

Git

Git has emerged as the most popular DVC, expanding from its origins as a Linux kernel tool to support other platforms and features required by other projects. Git is written in C, with some modules written in Perl. It is particularly influential among Ruby programmers and Linux users. Git is distinguished from other DVCs by its extremely high performance, cheap local branching, and the use of a "staging area" between the working files and the repository.

Mercurial

Mercurial is an open source DVC written in Python. Mercurial uses the command "hg" after the chemical symbool for Mercury. Although it trails in popularity behind Git, it is somewhat easier to learn. It is particularly popular among Python programmers and Windows users. Mercurial, like Git, was also started in response to BitMover's decision to end support for its gratis version of BitKeeper. Despite not being chosen for the Linux kernel project, Mercurial's development continued and the program has established a substantial community.

Others

Other DVCs include Bazaar (bzr; used heavily by Ubuntu's developers and development platform, Launchpad.net), Darcs (written in Haskell) and Monotone (written in Java).

Comparison with centralized version control

Centralized version control emerged in response to a need for collaboration with users on other systems. The natural choice was a centralized system, essentially a magnification of the local version control model. This system model consists of a central server containing versioned files from which users can check out a file to manipulate on their local machines. They can then check the file back in once changes have been made. The benefits of this system include the ability for an administrator to closely monitor activity and control what users have access to files. However, the obvious downside to this system is the single point of failure inherent in a central server. When the server is offline, no one can work on versioned files or save them to the central project. If the server hasn't been backed up properly, there is always the risk that all data will be lost. Additionally, as a project evolves, it becomes more complex and different elements require their own separate developments and isolated tests. Uploading previously branched changed simultaneously to a central location provides the elements of a perfect storm for potential problems down the line.[2]

In a centralized system, users must check in to the main trunk of a project and changes are cumulatively progressive in order of addition. Changes are invisible to other users until they are contained in the main trunk.

In a distributed system, all users have a complete repository on their local machine. All the changes made by each user live in these local repositories and can be shared among users or pushed into a common repository. Each change has a unique id and recording, downloading and applying changes are separate steps in DVC, while they are simultaneous in a centralized system. The key factors that distinguish DVCs from CVCs are that it works offline, connectivity is only necessary to share changes, a distributed system is fast, due to the local nature of the work and a distributed system doesn't require any labor intensive background software to operate.

Merging

Merging is certainly possible in a centralized system, but is tedious and can be complicated. However, merging and branches are integral to the way a distributed system works, and so are fast and easy in a DVC. Essentially, merging occurs when changes have been made to the same project by two different users, resulting in two recent versions that need to be resolved into one, coherently. If a conflict does arise with a merge, it can typically be resolved quickly and efficiently as languages like Git and Mercurial can trace the parentage of two branch tips in a merge to reason about the appropriate result. In other words, the two most recent versions of a file, each existing in a different location, or having come from a different source can be logically combined based on the respective parent sources.

A basic merge conflict example is a situation in which two versions have modified the same part of the same file differently, depending on the language you're using, say, Git, they should not automatically merge, but wait for the user to resolve the conflict.[3]

Workflow

Because of the diversified and fluid nature of the distributed system, workflows distinguished by this functionality become possible. A typical example workflow is one that has different levels of stability, such that the main branch is the stable version of a project, available to the public or to others, the next branch down is slightly less stable but is representative of the stable branch, such that this sub branch is constantly in development and can be merged into the main branch any time it becomes stable. Below that you would have a less stable sub branch for different development issues and so on. Essentially each branch has the potential to be as particular as a different way of working on the same problem, a very simple 'Save As' process.

Specific Workflow Models include:

Centralized Model

A logical first step for users transferring from a Centralized system, a centralized workflow consists of a central repository and any number of users with their own private repositories. Users can pull from and push to the server directly. The benefit of a centralized workflow is that users are able to work within a familiar structure while enjoying all the benefits of a distributed version control system. Users typically cannot push if the file in question has been updated since the last time they fetched, so the most current version of a file is always in play.

Dictator Model

In a Dictator workflow, one person has access to the centralized server. Users can only commit to their own repositories and send a pull request to the dictator, who in turn can push to the blessed repository.

Dictator and Lieutenants Model

The Dictator and Lieutenants workflow is how the Linux kernel is run, and is typical of larger projects. One person has access to the blessed repository and can pull from his/her lieutenants, who in turn are in charge of specific branches of development and have the most current and coherent versions of those branches in their repository. Lieutenants pull from developers and the dictator pulls form lieutenants. This allows a hierarchical organization supplemented by the benefits of distribution.

Integration Manager Model

For larger, but less centralized projects, the integration manager model is often used. Like the dictator model, one person has access to push to the blessed repository, and developers push to a public version of their own repositories, then send a pull request to the integration manager. This is how many open-source projects are run and how GitHub in particular operates.[4]

Non-Linear Model

Typical of developers working primarily within their own repository and sharing with other developers. Much lateral development happens amongst developers and may or may not be pushed to the central repository as projects are so active. But developers are at liberty to push and pull from the central repository if they choose.

Adoption

Open source software

Open source DVCs have swept the open source community. Within just a few years of their release Git and Mercurial in particular boast some of the largest open source projects as partial or complete adopters. These include Android, Debian, Eclipse, GNOME, GTK+, Mozilla, Netbeans, the Linux Kernel, OpenJDK, Perl, Qt, and Ruby on Rails. Likewise, all of the major open source code hosting services now support multiple DVCs.

DVCs were designed with open source development in mind, so many of their features mesh nicely with the open source workflow. For example, the emphasis on change sets (instead of versions) allows patters of non-linear development that are less widespread in proprietary software.

GitHub and BitBucket

GitHub.com was launched in 2008 as a hosting service for Git repositories. Improving upon many of the perceived flaws of older hosting services, GitHub has captured the vast majority of the Git hosting market share. GitHub's popularity has been partly responsible for the rising popularity of Git itself. BitBucket.org was also launched in 2008, offering a GitHub-like interface for Mercurial repositories. GitHub, BitBucket, and similar sites have swept the open source community, with many of the largest open source projects migrating to DVC and new DVC hosts simultaneously. Many of these new DVC hosts offer free public repositories for open source projects, charging only for private repositories.

Barriers to adoption

Support

Many companies are wary of using software that doesn't offer vendor support. While users of open source version control systems will turn to mailing lists, Internet Relay Chat, forums, and the like for help, this type of informal support may be perceived as risky by large companies. This perception has fueled some of the demand for proprietary VCSes like Perforce and Microsoft Visual SourceSafe.

Since proprietary version control systems are mostly centralized (BitKeeper is a notable exception), and DVCs are nearly all open source, the lack of support may slow industry adoption. At present, companies seeking commercially supported DVCs can purchase BitKeeper's Pro or Enterprise licenses.[5] Kiln, from Fog Creek Software, offers a Mercurial-based DVC that simplifies deployment and code review. It too includes technical support.[6]

Auditing

Reliable auditing of the major DVCs is generally impossible. DVCs allow users to permanently delete data and alter saved history. In some cases it may be impossible to recover data or to determine which user introduced a given change. These abilities have little to do with the distributed model per se, but do separate the popular centralized systems from the popular decentralized ones. Some source control users may require reliable auditing in order to protect intellectual property or comply with record-keeping laws. git-svn is a tool that offers some Git features on top of an existing Subversion repository, and as such may offer some of Subversion's auditing abilities.

Access controls controls are difficult to enforce in a DVC, since DVCs are designed for each user to have a complete history of the repository stored locally on his or her machine. Gitolite and Gitosis were developed to offer per-repository and per-branch/tag access controls, but it is unclear what success, if any, they have had in driving corporate adoption of Git.

Platform support

Git, the most widely used DVC, was developed specifically for Linux kernel development. A Mac OS X port was achieved at a later date, but Windows lagged behind. Today Git can be run on Windows using Cygwin for POSIX emulation, or with the native msysGit can be run on Windows using Cygwin or native port called msysgit. Both these tools have greatly improved since their early releases, but the Git experience on Windows may still be behind Linux and Mac OS X. Mercurial and Bazaar do not suffer from the same platform fragmentation problems, probably owing to their multi-platform histories and largely Python (as opposed to C) implementations.

Development tool integration

Centralized systems like Subversion and CVS are widely used through graphical user interface (GUI) tools, especially in the Windows community. TortoiseSVN offers tight integration with Windows Explorer, and AnkSVN brings Subversion to Microsoft Visual Studio through a plugin. GUIs are also widely used in the Java community, the Subclipse and Subversive Eclipse plugins, for example. GUIs and IDE integration are standard for proprietary systems like Microsoft Visual SourceSafe or Team Foundation Server.

Many DVCs however were born from the Linux community which values GUIs and Integrated Development Environments (IDEs) rather less. Those accustomed to working with IDE GUI plugins may have difficulty transitioning to a terminal-based workflow. Efforts are ongoing to bring GUI support to the popular DVCs.

TortoiseHG has succeeded in replicating many of the ToirtoiseSVN features on Windows, but its GNOME (Linux) offerings less mature. An OS X port has not yet been released. JGit is a pure Java implementation of Git used by the Eclipse EGit plugin. EGit and JGit are officially supported by the Eclipse Foundation, so JGit-based plugins may become common in Java-based IDEs like Eclipse, NetBeans, and IntelliJ Idea. To date, however, these tools have not been widely used. Early versions were blamed for corrupting Git repositories, and they are not supported by GitHub.[7] Many other GUIs are being actively developed, but they are generally not as mature as GUIs for CVS and Subversion.

References