In computing, parallel computation is performed by divvying up computer tasks into multiple sub-tasks in order to perform the computation and by allowing the sub-tasks to run concurrently, i.e., at the same time. Each task is usually, but not necessarily, assigned a different CPU (or CPU core on multi-core CPUs) for execution. These tasks comprise programming instructions, the data the instructions are to operate on, and parameters (initial conditions) which instruct the program how to operate on the data or where to start. In most parallel programs, each task uses the same program instructions and parameters but different data ("SPMD"), or each task uses multiple instruction sets but the same data. Sometimes, everything is the same, but this is usually for the purpose of providing data security through redundancy, and rarely falls under the topic of parallelism. Sometimes, only the parameters are different, and other times, everything is different, such as with client-server programming.
When the computation problem requires (or performs best) that the tasks work extremely closely together, the parallelism is known as fine-grained. When the tasks work independently of each other, the parallelism is known as coarse-grained. As a rule-of-thumb, the finer the grain, the more complex the program and the greater the need for bandwidth between the different CPUs.
Well-known examples of parallel programs include the SETI@home project which used home PCs to analyze the universe's radio waves for signs of extra-terrestrial intelligence, Craig Venter's TIGR which used massively parallel machines to crack the human genome, and IBM's Deep Blue which beat legendary chess grandmaster Gary Kasparov.
Examples of Parallelism
An example of coarse parallelism is a ray tracer rendering a separate output pixel on each available CPU. An example of fine grained parallelism is found in multiplying an n-dimensional vector by a scalar. The vector is cut up and a chunk is distributed to each CPU, the multiplication is done on each chunk, and the result is reassembled.
These examples illustrate key limitations to parallel computing. First, there is the overhead of distributing and reassembling the computation. Second, there is the challenge of developing parallel algorithms. In the ideal case, such as in a ray tracer, elements of the problem (pixels to be rendered) do not depend on one another. That is, the computing one output pixel does not require information or affect any of the other output pixels. These uncoupled problems are referred to as Embarrassingly Parallel. Many problems do not fit in to this category, an example being the classic N-body.
Parallel algorithms and programs are significantly more difficult to write and debug. However, as multicore processors (many armed with SIMD commands) become increasingly common, parallel computation is finding its way in to more mass-consumer code . (e.g. mp3 encoders using SIMD, photoshop multiprocessor support, etc).
Note that it is generally possible to run parallel code on a single processor. Multiple processors are simulated through context switching. When running on a single processor, parallel algorithms typically have worse performance than their serial analogues, mainly because of the overhead of the distribution/communication logic.
A challenge in writing parallel software is decomposing the problem domain so that multiple portions can be solved simultaneously. In some cases, this is very easy. With ray tracing, the radiance computation for each output pixel is completely independent from the other output pixels. Thus a separate computation stream can be used to generate each pixels output. Because the streams do not need to communicate with each other, the only overhead is in collating the results. This scenario is often referred to as embarrassingly parallel.
Many problems do not decompose quite so easily. Computing the nth digit of Pi using a traditional series expansion requires computing all the preceding digits. Thus it is not possible the parallelize the computation of digits n,n+1,n+2,etc because each output depends on the one before. However, using the BBP digit-extraction algorithm allows digits of Pi to be computed independently, so now the problem has become parallelizable.
Another issue is managing the communication overhead between parallel processes. Having two processors does not make a computer twice as fast as having one. Only in the ideal case can double-performance be achieved. In reality, there are diminishing returns as more processors are added.
For critical sections of code to properly handle shared resources, some method for mutual exclusion must be in place. On single processor (and single core) machines, it may be possible to turn off interrupts on the CPU. This prevents the operating system's scheduler from swapping out the current process. Since the process can't be swapped out, it has exclusive access to all the system variables.
In situations where simple interrupt disabling is not possible, the CPU must typically offer some atomic operations to allow for parallel code to be written. A traditional example is the "test and set" operation, which can be used to build up semaphores. Note that there are methods which do not use the CPU for mutual exclusion. One of the first examples of such a code was Dekker's Algorithm , which takes advantage of memory interlock to achieve mutual exclusion.
Additional skeletal content:
- Compare and Swap (for Lockfree programming) 
- SIMD, MIMD instructions
- SIMD: Intel SSE3 instructions. Operate on 128-bit register which can hold 4-floats/2-doubles/etc.
Low Level Primitives
- Introduced by Dijkstra. 
- Normally represented by an integer initialized to either 0 or 1
- P operation - decreases value of semaphore by one; if result is nonnegative, process may continue; if result is negative, process stopped and added to process waiting list for semaphore
- V operation - increases value of semaphore by one; if result is positive, no further effect; if result is nonpositive, one of the processes on semaphore waiting list is removed from list (and this removed process may proceed with its execution).
- Supported in POSIX standard
- Special purpose languages for parallel computation
- General purpose languages for parallel computation
- Pure functional languages
- Auto vectorizing compilers
- Speciality languages for Parallel programming
To simplify the development of parallel programs, libraries and standards have been created which operate at higher levels of abstraction than raw semaphores and SIMD instructions.
OpenMP is an open standard which allows for C, C++, and FORTRAN source code to be annotated with parallelization hints. These annotations tell the compiler which loops can be parallelized and indicate what variable dependencies between interations, if any, exist. OpenMP is supported in many popular compilers, including the Intel family for C/C++/FORTRAN, the GCC version 4 series, and Sun's CC compiler.
The popular Message Passing Interface (MPI), provides another high level abstraction for writing parallel code.
A major recent development has been the advancement of Lock-Free programing. This allows for the development of thread-safe code with out the use of locking mechanisms such as semaphores . While implementing correct lock-free algorithms is considered extremely difficult, they show significant performance advantage over their traditional counterparts. For example, the lock-free dynamic memory allocator implemented in  shows up to a 331% speedup over a traditional lock-based allocator when operating on 16 processors under maximum contention. This lock-free allocator also offers better scalability and latency than traditional allocators.
- Sutter, Herb. "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal Mar. 2005. <http://www.gotw.ca/publications/concurrency-ddj.htm>.
- E.W. Dijkstra, Cooperating Sequential Processes (Techniche Hogeschool, Eindhoven, 1965). Reprinted in: F. Genuys (ed.), Programming Languages, Academic Press, 1968, 43--112.
- Alexandrescu, Andrei. "Lock-Free Data Structures." C/C++ Users Journal 1 Sept. 2004. 27 Mar. 2007 <http://www.ddj.com/dept/cpp/184401865>.
- Dijkstra, E. W. 1967. The structure of the “the”-multiprogramming system. In Proceedings of the First ACM Symposium on Operating System Principles J. Gosden and B. Randell, Eds. SOSP '67. ACM Press, New York, NY, 10.1-10.6. DOI= http://doi.acm.org/10.1145/800001.811672
- May, D. 1983. OCCAM. SIGPLAN Not. 18, 4 (Apr. 1983), 69-79. DOI= http://doi.acm.org/10.1145/948176.948183
- Michael, M. M. 2004. "Scalable lock-free dynamic memory allocation". SIGPLAN Not. 39, 6 (Jun. 2004), 35-46. DOI= http://doi.acm.org/10.1145/996893.996848