Distributed version control: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>AnnaLisa Allegretti
imported>AnnaLisa Allegretti
Line 33: Line 33:


== Comparison with centralized version control ==
== Comparison with centralized version control ==
Centralized version control emerged in response to a need for collaboration with users on other systems.  The natural choice was a centralized system, essentially a magnification of the local version control model.  This system model consists of a central server containing versioned files from which users can check out a file to manipulate on their local machines.  They can then check the file back in once changes have been made. The benefits of this system include the ability for an administrator to closely monitor activity and control what users have access to files.  However, the obvious downside to this system is the single point of failure inherent in a central server.  When the server is offline, no one can work on versioned files or save them to the central project.  If the server hasn't been backed up properly, there is always the risk that all data will be lost.  Additionally, as a project evolves, it becomes more complex and different elements require their own separate developments and isolated tests.  Uploading previously branched changed simultaneously to a central location provides the elements of a perfect storm for potential problems down the line.
In a centralized system, users must check in to the main trunk of a project and changes are cumulatively progressive in order of addition.  Changes are invisible to other users until they are contained in the main trunk. 
In a distributed system, all users have a complete repository on their local machine.  All the changes made by each user live in these local repositories and can be shared among users or pushed into a common repository.  Each change has a unique id and recording, downloading and applying changes are separate steps in DVC, while they are simultaneous in a centralized system.  The key factors that distinguish DVC from CVC are that it works offline, connectivity is only necessary to share changes, a distributed system is fast, due to the local nature of the work and a distributed system doesn't require any labor intensive background software to operate.
=== Merging ===
=== Merging ===
Merging is certainly possible in a centralized system, but is tedious and can be complicated.  However, merging and branches are integral to the way a distributed system works, and so are fast and easy in a DVC. Essentially, merging occurs when changes have been made to the same project by two different users, resulting in two recent versions that need to be resolved into one, coherently.  If a conflict does arise with a merge, it can typically be resolved quickly and efficiently as languages like Git and Mercurial can trace the parentage of two branch tips in a merge to reason about the appropriate result.  In other words, the two most recent versions of a file, each existing in a different location, or having come from a different source can be logically combined based on the respective parent sources. 
A basic merge conflict example is a situation in which two versions have modified the same part of the same file differently, depending on the language you're using, say, Git, they should not automatically merge, but wait for the user to resolve the conflict.
=== Workflow ===
=== Workflow ===
 
Because of the diversified and fluid nature of the distributed system, workflows distinguished by this functionality become possible.  A typical example workflow is one that has different levels of stability, such that the main branch is the stable version of a project, available to the public or to others, the next branch down is slightly less stable but is representative of the stable branch, such that this sub branch is constantly in development and can be merged into the main branch any time it becomes stable.  Below that you would have a less stable sub branch for different development issues and so on.  Essentially each branch has the potential to be as particular as a different way of working on the same problem, a very simple 'Save As' process.


== Adoption ==
== Adoption ==

Revision as of 13:08, 9 August 2010

All unapproved Citizendium articles may contain errors of fact, bias, grammar etc. A version of an article is unapproved unless it is marked as citable with a dedicated green template at the top of the page, as in this version of the 'Biology' article. Citable articles are intended to be of reasonably high quality. The participants in the Citizendium project make no representations about the reliability of Citizendium articles or, generally, their suitability for any purpose.

Nuvola apps kbounce green.png
Nuvola apps kbounce green.png
This article is currently being developed as part of an Eduzendium student project. The course homepage can be found at CZ:Special_Topics_2010.
To provide students with experience in collaboration, you are warmly invited to join in here, or to leave comments on the discussion page. The anticipated date of course completion is 13 August 2010. One month after that date at the latest, this notice shall be removed.
Besides, many other Citizendium articles welcome your collaboration!


This article is a stub and thus not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

Distributed version control systems such as Git and Mercurial have emerged in the last few years as competitors to older centralized version control systems such as Subversion and CVS.

Overview

Terminology

A tool that manages and tracks different versions of software or other content is referred to generically as a version control system (VCS), a source code manager (SCM), a revision control system (RCS), and with several other permutations of the words "revision," "version," "code," "content," "control," "management," and "system." [...] [E]ach system addresses the same issues: develop and maintain a repository of content, provide access to historical editions of each datum, and record all changes in a log.[1]

History

Local version control

Centralized version control

Distributed version control

Linux kernel crisis

Popular DVCSes

Git

Git was started by Linus Torvalds for use on the development of the Linux kernel. Originally, Linus had been using the commerical BitKeeper software, but this caused controversy among advocates of free software. Git was developed as an alternative. It is written in C with some modules written in Perl.

Git has become popular with quite a number of developers especially in the Ruby community, especially due to the availability of Github, a commercial Git hosting site that provides free hosting for open source projects, and Gitorious, an open source Git hosting system. Integration of Git exists with TextMate, Vim, Redmine and many other systems.

In addition to the main implementation in C, there are implementations of Git in other languages: JGit (Java) and Dulwich (Python, named after the town in which "Mr and Mrs Git" live in a Monty Python sketch). Dulwich can be used for interoperability between Git and Mercurial and JGit is used for Java IDE integration and to push and pull Git repostiories to Amazon's S3 cloud storage platform.

Mercurial

Mercurial is an open source DVCS written in Python.

Others

Other DVCSes include Bazaar (bzr; used heavily by Ubuntu's developers and development platform, Launchpad.net), Darcs (written in Haskell) and Monotone (written in Java).

Comparison with centralized version control

Centralized version control emerged in response to a need for collaboration with users on other systems. The natural choice was a centralized system, essentially a magnification of the local version control model. This system model consists of a central server containing versioned files from which users can check out a file to manipulate on their local machines. They can then check the file back in once changes have been made. The benefits of this system include the ability for an administrator to closely monitor activity and control what users have access to files. However, the obvious downside to this system is the single point of failure inherent in a central server. When the server is offline, no one can work on versioned files or save them to the central project. If the server hasn't been backed up properly, there is always the risk that all data will be lost. Additionally, as a project evolves, it becomes more complex and different elements require their own separate developments and isolated tests. Uploading previously branched changed simultaneously to a central location provides the elements of a perfect storm for potential problems down the line.

In a centralized system, users must check in to the main trunk of a project and changes are cumulatively progressive in order of addition. Changes are invisible to other users until they are contained in the main trunk.

In a distributed system, all users have a complete repository on their local machine. All the changes made by each user live in these local repositories and can be shared among users or pushed into a common repository. Each change has a unique id and recording, downloading and applying changes are separate steps in DVC, while they are simultaneous in a centralized system. The key factors that distinguish DVC from CVC are that it works offline, connectivity is only necessary to share changes, a distributed system is fast, due to the local nature of the work and a distributed system doesn't require any labor intensive background software to operate.

Merging

Merging is certainly possible in a centralized system, but is tedious and can be complicated. However, merging and branches are integral to the way a distributed system works, and so are fast and easy in a DVC. Essentially, merging occurs when changes have been made to the same project by two different users, resulting in two recent versions that need to be resolved into one, coherently. If a conflict does arise with a merge, it can typically be resolved quickly and efficiently as languages like Git and Mercurial can trace the parentage of two branch tips in a merge to reason about the appropriate result. In other words, the two most recent versions of a file, each existing in a different location, or having come from a different source can be logically combined based on the respective parent sources.

A basic merge conflict example is a situation in which two versions have modified the same part of the same file differently, depending on the language you're using, say, Git, they should not automatically merge, but wait for the user to resolve the conflict.

Workflow

Because of the diversified and fluid nature of the distributed system, workflows distinguished by this functionality become possible. A typical example workflow is one that has different levels of stability, such that the main branch is the stable version of a project, available to the public or to others, the next branch down is slightly less stable but is representative of the stable branch, such that this sub branch is constantly in development and can be merged into the main branch any time it becomes stable. Below that you would have a less stable sub branch for different development issues and so on. Essentially each branch has the potential to be as particular as a different way of working on the same problem, a very simple 'Save As' process.

Adoption

Open source software

Open source DVCs have swept the open source community. Within just a few years of their release Git and Mercurial in particular boast some of the largest open source projects as partial or complete adopters. These include Android, Debian, Eclipse, GNOME, GTK+, Mozilla, Netbeans, the Linux Kernel, OpenJDK, Perl, Qt, and Ruby on Rails. Likewise, all of the major open source code hosting services now support multiple DVCs.

DVCs were designed with open source development in mind, so many of their features mesh nicely with the open source workflow. For example, the emphasis on change sets (instead of versions) allows patters of non-linear development that are less widespread in proprietary software.

GitHub and BitBucket

GitHub.com was launched in 2008 as a hosting service for Git repositories. Improving upon many of the perceived flaws of older hosting services, GitHub has captured the vast majority of the Git hosting market share. GitHub's popularity has been partly responsible for the rising popularity of Git itself. BitBucket.org was also launched in 2008, offering a GitHub-like interface for Mercurial repositories. GitHub, BitBucket, and similar sites have swept the open source community, with many of the largest open source projects migrating to DVC and new DVC hosts simultaneously. Many of these new DVC hosts offer free public repositories for open source projects, charging only for private repositories.

Barriers to adoption

Support

Many companies are wary of using software that doesn't offer vendor support. While users of open source version control systems will turn to mailing lists, Internet Relay Chat, forums, and the like for help, this type of informal support may be perceived as risky by large companies. This perception has fueled some of the demand for proprietary VCSes like Perforce and Microsoft Visual SourceSafe.

Since proprietary version control systems are mostly centralized (BitKeeper is a notable exception), and DVCs are nearly all open source, the lack of support may slow industry adoption. At present, companies seeking commercially supported DVCs can purchase BitKeeper's Pro or Enterprise licenses.[2] Kiln, from Fog Creek Software, offers a Mercurial-based DVC that simplifies deployment and code review. It too includes technical support.[3]

Auditing

Reliable auditing of the major DVCs is generally impossible. DVCs allow users to permanently delete data and alter saved history. In some cases it may be impossible to recover data or to determine which user introduced a given change. These abilities have little to do with the distributed model per se, but do separate the popular centralized systems from the popular decentralized ones. Some source control users may require reliable auditing in order to protect intellectual property or comply with record-keeping laws. git-svn is a tool that offers some Git features on top of an existing Subversion repository, and as such may offer some of Subversion's auditing abilities.

Access controls controls are difficult to enforce in a DVCS, since DVCs are designed for each user to have a complete history of the repository stored locally on his or her machine. Gitolite and Gitosis were developed to offer per-repository and per-branch/tag access controls, but it is unclear what success, if any, they have had in driving corporate adoption of Git.

Platform support

Git, the most widely used DVC, was developed specifically for Linux kernel development. A Mac OS X port was achieved at a later date, but Windows lagged behind. Today Git can be run on Windows using Cygwin for POSIX emulation, or with the native msysGit can be run on Windows using Cygwin or native port called msysgit. Both these tools have greatly improved since their early releases, but the Git experience on Windows may still be behind Linux and Mac OS X. Mercurial and Bazaar do not suffer from the same platform fragmentation problems, probably owing to their multi-platform histories and largely Python (as opposed to C) implementations.

Development tool integration

Centralized systems like Subversion and CVS are widely used through graphical user interface (GUI) tools, especially in the Windows community. TortoiseSVN offers tight integration with Windows Explorer, and AnkSVN brings Subversion to Microsoft Visual Studio through a plugin. GUIs are also widely used in the Java community, the Subclipse and Subversive Eclipse plugins, for example. GUIs and IDE integration are standard for proprietary systems like Microsoft Visual SourceSafe or Team Foundation Server.

Many DVCs however were born from the Linux community which values GUIs and Integrated Development Environments (IDEs) rather less. Those accustomed to working with IDE GUI plugins may have difficulty transitioning to a terminal-based workflow. Efforts are ongoing to bring GUI support to the popular DVCs.

TortoiseHG has succeeded in replicating many of the ToirtoiseSVN features on Windows, but its GNOME (Linux) offerings less mature. An OS X port has not yet been released. JGit is a pure Java implementation of Git used by the Eclipse EGit plugin. EGit and JGit are officially supported by the Eclipse Foundation, so JGit-based plugins may become common in Java-based IDEs like Eclipse, NetBeans, and IntelliJ Idea. To date, however, these tools have not been widely used. Early versions were blamed for corrupting Git repositories, and they are not supported by GitHub.[4] Many other GUIs are being actively developed, but they are generally not as mature as GUIs for CVS and Subversion.

References

  1. Jon Loeliger, Version Control with Git, chapter 1, ISBN 0596520123
  2. BitKeeper Sales
  3. Kiln Support
  4. GitHub Help - Fixing egit corruption