Processing a highly structured and complex pattern of sensory input as a unified percept of "music" is probably one of the most elaborate features of the human brain. In recent years, attempts have been made to investigate the neural substrates of music perception in the brain. Though progress has been made with the use of rather simplified musical stimuli, understanding how music is perceived and how it may elicit intense sensations is far from being understood.
Theoretical models of music perception are facing the big challenge to explain a vast variety of different aspects which are connected to music, ranging from temporal pattern analysis such as metre and rhythm analysis, over syntactic analysis, as for example processing of harmonic sequences, to more abstract concepts like semantics of music and interplay between listeners' expectations and suspense. It was tried to give some of these aspects a neural foundation which will be discussed below.
Several authors have proposed a modular framework for music perception . After Fodor, mental "modules" have to fulfil certain conditions, among the most important ones of which are the concepts of information encapsulation and domain-specificity. Information encapsulation means that a (neural) system is performing a specific information-processing task and is doing so independent of the activities of other modules. Domain-specificity means that the module is reacting only to specific aspects of a sensory modality. Fodor defines further conditions for a mental module like rapidity of operation, automaticity, neural specificity and innateness that have been debated with respect to the validity for music-processing modules.
However, there is evidence from various complementary approaches that music is processed independently from e.g. language and that there is not even a single module for music itself, but rather sub-systems for different relevant tasks. Evidence for spatial modularity comes mainly from brain lesion studies where patients show selective neurological impairments. Peretz and colleagues list several cases in a meta-study in which patients were not able to recognize musical tunes but were completely unaffected in recognizing spoken language. Such "amusia" can be innate or acquired, for example after a stroke. On the other hand, there are cases of verbal agnosia where the patients can still recognize tunes and seem to have an unaffected sensation of music. Brain lesion studies also revealed selective impairments for more specialized tasks such as rhythm detection or harmonical judgements.
The idea of modularity has also been strongly supported by the use of modern brain-imaging techniques like PET and fMRI. In these studies, participants usually perform music-related tasks (detecting changes in rhythm or out-of-key notes). The obtained brain activations are then compared to a reference task, so one is able to detect brain regions which were especially active for a particular task. Using a similar paradigm, Platel and colleagues have found distinct brain regions for semantic, pitch, rhythm and timbre processing  .
To find out the dependencies between different neural modules, brain imaging techniques with a high temporal resolution are usually used. These are e.g. EEG and MEG which can reveal the delay between stimulus onset and the processing of specific features. These studies showed for example that pitch height is detected within 10-100 ms after stimulus onset, while irregularities in harmonic sequences elicit an enhanced brain response 200 ms after stimulus presentation. Another method to investigate the information flow between the modules in the brain is TMS. In principle, also DTI or fMRI observations with causality analysis can reveal those interdependencies.
Early auditory processing
A neural description of music perception has to start with the early auditory system in which the raw sensory input in the shape of sound waves is translated into an early neural representation.
Sound waves are described by their frequency components and the respective intensities. Both frequency (measured in Hz) and intensity (measured in energy per area contained in the air pressure variation) translate into psychophysical quantities, namely pitch and loudness. After the sound waves passed the pinna and auditory canal, they hit the tympanic membrane which is connected via the ossicles and the oval window to the fluid-filled cochlea. Here the basilar membrane responds to movements of the cochlear fluid. The maximal displacement of the basilar membrane is dependent on the frequency and intensity of the stimulus.
A special type of cells, the hair cell receptors, convert the mechanical elongation of the basilar membrane into an electrical signal which is then transducted by the auditory nerve which consists of around 30,000 spiral ganglion cells. Because hair cells at a given location on the membrane are only excited when a particular frequency in the sound wave is present, one can think of the cochlea performing a crude Fourier analysis of the input signal. Each ganglion cell submits by its firing rate how well a certain range of frequencies around a preferred characteristic frequency is contained in the original signal (the frequencies and intensities to which a neuron responds are called the receptive field). Theoretically, such a ganglion cell might be modelled by applying a bandpass filter which corresponds to its receptive field.
From the auditory nerve on, there are multiple pathways on which auditory information is conducted. One primary pathway goes via the brainstem (the cochlear nuclei, superior olive and inferior colliculus) and the thalamus (medial geniculate nuclei) to the primary auditory cortex in the temporal lobes.
A common principle of organization of sound representations along this pathway is known as tonotopy. Tonotopy means that neighbouring neurons encode for nearby frequencies. Inside the cochlea, this is just a physical consequence from the resonance behavior of the basilar membrane. However, this mapping is preserved also in the later stages, though not in all areas of the auditory cortex. Sounds from around 200 Hz up to 20,000 Hz are represented tonotopically. For lower frequencies from 20 Hz to 4 kHz an additional mechanism is used to encode the frequency property of the signal. This mechanism is called phase locking: A single neuron or a whole population is responding with a spike whose firing is locked to the phase of the sound wave. For low frequencies, a single neuron fires one spike per wave period (the period is the inverse of the sound frequency).
Before the next steps in music processing can take place, single melodic lines and temporal patterns have to be extracted from the incoming stream of auditory information. In the case of multiple physical sources, the interaural disparities help to localize and discriminate different sound sources. In polyphonic musical pieces, the extraction of single melodic streams can be explained with the application of Gestalt principles (such as similarity, proximity and continuity) with respect to melodic, rhythmic or timbral aspects.
In case of tunes with lyrics, language is processed in parallel to the musical features in separate brain areas.
Pitch is the psychophysical equivalent of frequency and is defined as a percept according to which sounds may be ordered from low to high. For a pure tone, pitch corresponds to the frequency, in complex tones, the perceived pitch corresponds to the fundamental frequency, even if only the harmonics are present in the stimulus ("periodicity pitch" or "effect of the missing fundamental"). A tone is then defined as a sound which can be attributed a certain pitch height. In a chromatic scale, the pitch scale is divided into octaves with 12 semi-tones each. The organization in octaves and so the percept of pitch differences is logarithmic: The perceived difference between a 220 Hz and 440 Hz tone is the same as between a 440 Hz and 880 Hz tone.
In the 1980s it was shown that cats seem to have a pitch perception that is innate also in humans in the sense that they cannot distinguish a complex tone without the fundamental frequency from a pure tone which only contains that frequency . Other experiments showed pitch perception also in species of birds and monkeys. Whitfield and colleagues showed in a follow-up study that the auditory cortex is necessary to get a unifying pitch perception: Cats whose auditory cortex was removed did no longer confuse both test stimuli.
One might ask whether the tonotopy found in auditory cortex is then based on pitch rather than frequency. Pantev et al. showed that this might indeed be the case. In a human MEG study, they compared brain activations in auditory cortex when stimulating with 1) a complex tone with 250 Hz fundamental and the 4th to 7th harmonics, 2) a pure tone at 250 Hz and 3) a pure tone at 1000 Hz (the fourth harmonics). They could show that the first two stimuli led to brain activations in the same cortical area whereas the third stimulus activated a different region. Although this supports evidence that tonotopy on that level might be organized with respect to pitch, it does not say anything about the response properties of single neurons. In fact, single neurons only seem to encode frequency, but a population code was suggested for the encoding of pitch information.
Furthermore, auditory cortex is performing a feature extraction which is going beyond pitch height. 10-100 ms after stimulus presentation, information about pitch chroma, timbre, intensity and roughness of the sound are processed.
The next step involves encoding a sequence of multiple tones which build up pitch contours, hence melodies. On a single-cell level, experiments have been done by McKenna et al. with cats (1989). The found neurons which responded only to a second tone in a pairing of two tones. Moreover, when presenting the cat with sequences of ascending and descending tones, they found that the characteristic frequency of a given neuron was dependent on the melodic context: it changed from 13 kHz in the descending sequence to a preferred frequency of 12 kHz in an ascending sequence. McKenna and colleagues argue that the complex pattern of excitation and inhibition of neighbouring cells lead to these adaptive effects.
Epinosa and Gerstein performed multi-unit recordings in anaesthetized cats and experimented with a three-tone stimulus with all possible permutations and found a distinct activation pattern across local neural populations for each of the orders. This was a first hint for a distributed representation of melodies. It also showed that functional connections are highly plastic and show timescales which are in the order of the experiments themselves. But the results also show that analysis becomes quite complicated with a simple melodic sequence already.
Western tonal music often uses 5-7 pitches from the chromatic scale with an unequal spacing. There exists a hierarchy in this scale, from the tonic as most dominant tone, followed by the fifth, the third, the other scale tones and finally any other non-scale tones. The recognition of the octave and possibly of the fifth is innate for humans. Also monkeys have shown octave generalization . After Peretz, a hierarchical order in scales "facilitates perception, memory, performance by creating expectancies".
When one describes music as "discrete elements [that] are organized into sequences that are structured according to syntactic regularities", it is evident to postulate a mental module for syntactic processing by analyzing harmonic structures, e.g. sequences of chords. Indeed, specific EEG-related signals have been found to correlate with the perception of syntactic irregularities in harmonic chord sequences. The knowledge of musical syntax is mostly implicit and processed automatically.
At least as important as the tonal analysis of music is the analysis of temporal structure, namely rhythm and metre. A rhythm analysis segments the auditory stream into temporal groups (possibly on the basis of Gestalt principles), while metre analysis deals with extracting the underlying beat (to which one would naturally tap his fingers). In metre perception, one can also find a hierarchy of strongly and weakly accented beats.
A simple model of metre extraction out of a temporal pattern comes from Povel and Essen (1985). They propose a set of rules according to which certain beats in a pattern sound accented. The tics of a group of internally generated clocks with different periods and phase offsets are now compared to the position of the accented beats and the clock whose tics had the most overlap is predicted to be the underlying metre. This model accounts for several finger tapping experiments and also predicts successfully how easy it is to memorize those patterns (patterns which do not induce any clock strongly are memorized harder).
A more sophisticated model of metre perception should also explain of dance movements and predict the notations of time signatures and barlines in scores. Such models are e.g. from Lerdahl and Jackendoff (Generative Theory of Tonal Music, 1983) with a set of 14 rules, models of Longuet-Higgins, Lee and Steedman or from Temperley (2001) which account for metre, phrasing, counterpoint, harmony and key prediction.
There are further aspects which need to be considered in a complete framework of music processing. However it is difficult to give these modules an exact position inside the proposed hierarchical framework. One of these aspects is memory. There is the need for a "musical lexicon" which contains all representations of specific musical phrases to which one has been exposed during lifetime. Aside from working memory for several musical aspects, an "associative memory" is needed to connect music to further non-musical information, such as the composer, lyrics or situations which are remembered during listening.
A mental module for "emotion expression analysis" has been proposed whose task it is to recognize and experience emotion coded in the music. No one will doubt that music has the power to elicit strong emotional responses. Relevant factors may be the mode (major or minor), tempo, timbre, dynamics and the interplay of tension and release and building up expectancies. However, one has to be careful about which of these elements have a universal meaning. Western listeners often claim that the minor third creates an intrinsically sad effect, but Irish, Hungarian and Spanish folk music contradict these findings (Aristotle said about the minor third that it "inspires entusiasm").
Listening to music has also been showed to have an effect on the human immune system, changes in heart rate and electrodermal activites and hormonal changes have been reported. The origin of "shivers" and "chills" and its connection to certain musical elements is not yet understood.
Concerning the |semantic content of music, a separate module was proposed. In the simplest case, music resembles gestures, objects or natural sounds and therefore conveys a meaning. It might also suggest a particular mood. There can be extra-musical associations as well (e.g. national anthems).
One important interdependence has been neglected so far, that are auditory-motor interactions. Now we see the brain as a dynamical system which generates a behavior / motor action depending on its (sensory) input and internal state. This feedback cycle is of course relevant for active music performance, dancing or just "tapping to the beat".
Three main issues in auditory-motor interactions have been separated: timing, sequencing and spatial organization. For the issue of timing, neural clock and counter mechanisms have been proposed through pulses and neural oscillatory behaviour. "Sequencing" refers to the task of ordering individual movements in temporal sequences and spatial relations. Several brain regions were identified for timing tasks: the cerebellum for tasks where short timings are relevant (below 1 s), the basal ganglia and supplementary motor area or larger timings and premotor and prefrontal cortices for the timing of complex movements. Learning of motor patterns is believed to happen in the basal ganglia, whereas the execution of complex sequences and motor prediction are again performed in the premotor and prefrontal cortices.
Of course, auditory and motor areas are tightly connected and this is an active field of research. Just to mention a few studies: In a MEG study, Haueisen and Knöche observed involuntary activity in the motor areas in pianists who were listening to a familiar piano tune  . The other way round, Haslinger (2005) showed that the mere watching of someone playing the piano keyboard without auditory support triggered responses in the early auditory domains in pianists.
So far, we described a modular framework of music perception based on some neural evidence. However, apart from the early auditory processing stages, we did not give any explanation of how the brain is actually implementing the proposed modules.
This provokes the questions of what we are expecting from a theoretical model of music perception. When we look from a functional viewpoint, one might construct models on different levels of abstraction: On the lowest level, one is interested in explaining coding of auditory features or information-processing with an explicit neural coding mechanism with the aim to reproduce neurophysiological findings. On a higher level, one might propose biologically plausible (network) models in order to explain psychophysical findings. In a more abstract way, algorithmic models and computer programs can be written with the aim to explain human's behaviour resulting from percept and interpretation.
A complementary approach of modeling was the descriptive modeling by proposing how music processing might be implemented on a coarse grained scale in the brain by identifying brain regions and their interconnections. However, just by naming these "neural correlates of music perception" one does not learn much about the actual mechanisms.
As stated above, one has accepted neural coding mechanisms for early representations of auditory stimuli, up to maybe primary auditory cortex. For tonal and temporal analysis, e.g. pitch discrimination, metre analysis, there exist algorithmic models and maybe even models based on neurobiologically plausible structures. However, no theoretical model is able to explain our unified percept of music and the relations it has to our emotions, body state, memory and motor control. A modular model as described above can only give the framework in which unifying computational models have to be developed.
Another problem with finding the neural basis of music perception lies in the omnipresent but unevitable reductionism. Most of the experiments on the auditory system have been performed with simple sine wave stimuli. Other experiments which have claimed to investigate the effects of music have used mostly synthetized MIDI-like stimuli of monophonic tunes which is still far from what could be could real music. Also, paradigms in sensory-motor experiments mainly involved finger tapping tasks or rhythmic limb movements - an amusing exception is "The neural basis of Human Dance" by Brown et al. .The general hope is that by dividing music into its parts and learning how the brain deals with the single elements, the rest will just "scale up" with more complex stimuli. But the inherent non-linearities in the brain seem to make this a hopeless attempt. Philip Ball summarizes the dilemma in a recent essay in Nature:
"Trying to understand music is a little like trying to understand biology. The problem is so hard that you have to be reductionist, breaking it down into the building blocks and how they function. Then you find that the original problem has evaporated: in this atomistic view, ‚life‘ or ‚music‘ ceases to be visible at all.“
- Koelsch, S.; Siebel, W.A. (2005). "Towards a neural basis of music perception". Trends in Cognitive Sciences 9 (12): 578-584. DOI:10.1016/j.tics.2005.10.001. Research Blogging.
- Peretz, I.; Coltheart, M. (2003). "Modularity of music processing.". Nat Neurosci 6 (7): 688-91. DOI:10.1038/nn1083. Research Blogging.
- Platel, H.; Price, C.; Baron, J.C.; Wise, R.; Lambert, J.; Frackowiak, R.S.; Lechevalier, B.; Eustache, F. (1997). "The structural components of music perception. A functional anatomical study". Brain 120 (2): 229-243. DOI:10.1093/brain/120.2.229. Research Blogging.
- (Chung and Colavita, 1976; Heffner and Whitfield, 1976)
- Wright, A.A.; Rivera, J.J.; Hulse, S.H.; Shyan, M. (2000). "Music Perception and Octave Generalization in Rhesus Monkeys". Music Perception 129 (3): 291-307.
- Fitch, W.T.; Rosenfeld, A.J. (2007). "Perception and Production of Syncopated Rhythms". Music Perception 25 (1): 43-58. DOI:10.1525/mp.2007.25.1.43. Research Blogging.
- Zatorre, R.J.; Chen, J.L.; Penhune, V.B.; Others, (2007). "When the brain plays music: auditory--motor interactions in music perception and production". Nature Reviews Neuroscience 8: 547-558. DOI:10.1038/nrn2152. Research Blogging.
- Haueisen, J.; Knösche, T.R. (2001). "Involuntary Motor Activity in Pianists Evoked by Music Perception". Journal of Cognitive Neuroscience 13 (6): 786-792. DOI:10.1162/08989290152541449. Research Blogging.
- Brown, S.; Martinez, M.J.; Parsons, L.M. (2006). "The Neural Basis of Human Dance". Cerebral Cortex 16 (8): 1157-1167.
- Ball, P. (2008). "Science & Music: Facing the music". Nature 453: 160-162. DOI:10.1038/453160a. Research Blogging.