A database is a collection of computer-processable records used for storing information. The simplest possible database organization, indexed sequential, has records stored by some collating rule, but has a mechanism for adding records whose sequence may put them between existing records, and a mechanism for deleting records inside the database. More advanced databases have more complex ways of organizing records. Contrast this with so-called "flat files", such as text files in CSV format, where a single line represents a record. In a "flat file", records cannot be deleted or inserted, except at the end of the file.
Advanced databases may have capabilities beyond complex organization. They may comply to rules for transaction processing, which require (see ACID properties) that a unit either run to completion before the database is updated, or -- if the work cannot be completed -- that the incomplete unit is "rolled back" without changing the state of the database. Both at their logical level of organization, as well as using physical mechanisms such as Redundant Arrays of Inexpensive Disks (RAID), such databases can be engineered to tolerate damage to storage media, or even destruction of an entire copy.
To protect against loss of a physical copy, it is common practice to maintain copies at multiple locations. In the simplest approach, a copy remote from the main site may simply be a real-time mirror of the data, or even a sequential backup file. More complex mechanisms either have complete copies at multiple sites, or have parts of the data at different sites.
Traditional databases have all content, other than backups, at one site. This does not preclude having users at multiple locations, only the main copy of data. For a variety of reasons ranging from fast access to local copies (i.e., distributed data) to the economies of not needing all data at all sites, there can be good reason to break away from the centralized model.
Once the data are no longer centralized, administration becomes a greater challenge. There is also a significant synchronization issue of ensuring that each instance of data is identical in each storage system.
There are useful tradeoffs among the cost of transmission bandwidth, the cost of storage, and the cost of transmission delay in retrieving remote data. Multicasting and peer-to-peer techniques should be evaluated for each design.
Especially when the data does not change frequently but will be accessed multiple times, a caching architecture may be attractive. This is quite common in applications such as entertainment content distribution.
In a caching system, the first time data is needed, it is requested from the remote site, but a copy is retained locally. That local copy may have an explicit time for which it remains valid (e.g., as in the Domain Name System), or there may be a finite cache size, from which the least recently used data is first deleted to make room for more information.
Distributed databases put a full copy of information at each location that uses it. As long as the data are kept synchronized, this is extremely fault-tolerant. There will be minimum delays in accessing the data, but, when the data change rapidly, the cost and complexity of update and synchronization become more complex.
Note that a distributed file system does not necessarily have the complex data organization of a true database, but such filesystems often can provide the infrastructure for a distributed database.
Bibliographic information systems that are updated at periodic intervals, such as MEDLINE, lend themselves to distributed storage.
In a federated database, different parts of the database reside in different locations, which does not preclude having redundant copies of those parts. Some type of directory service will be necessary to locate data, and this type of directory adds complexity beyond simple distribution. Federated databases still have the synchronization challenges of distributed databases, but also add the need for directory services.
A common commercial application would be for regional divisions of an enterprise to retain the data for their region, even distributing it within the region. When a region needs data that belongs to another region, it will treat the other region's database as a member of the federation.
While it is not a true database, the content of the World Wide Web has many of the characteristics of a federated database.