MPEG-FAQ 4.0: What is MPEG-Audio then ?
MPEG-FAQ 4.0:

What is MPEG-Audio then ?

From: "Harald Popp" <POPP@iis.fhg.de>

From: mortenh@oslonett.no

Date:          Fri, 25 Mar 1994 19:09:06 +0100

Q.      What is MPEG?
A.      MPEG is an ISO committee that proposes standards for 
        compression of Audio and Video. MPEG deals with 3 issues: 
        Video, Audio, and System (the combination of the two into one 
        stream). You can find more info on the MPEG committee in other 
        parts of this document. 
        
Q.      I've heard about MPEG Video. So this is the same compression 
        applied to audio?
A.      Definitely no. The eye and the ear... even if they are only a 
        few centimeters apart, works very differently... The ear has 
        a much higher dynamic range and resolution. It can pick out 
        more details but it is "slower" than the eye.
        The MPEG committee chose to recommend 3 compression methods 
        and named them Audio Layer-1, Layer-2, and Layer-3. 

Q.      What does it mean exactly?
A.      MPEG-1, IS 11172-3, describes the compression of audio 
        signals using high performance perceptual coding schemes. 
        It specifies a family of three audio coding schemes, 
        simply called Layer-1,-2,-3, with increasing encoder 
        complexity and performance (sound quality per bitrate). 
        The three codecs are compatible in a hierarchical 
        way, i.e. a Layer-N decoder is able to decode bitstream data 
        encoded in Layer-N and all Layers below N (e.g., a Layer-3 
        decoder may accept Layer-1,-2 and -3, whereas a Layer-2 
        decoder may accept only Layer-1 and -2.)

Q.      So we have a family of three audio coding schemes. What does 
        the MPEG standard define, exactly?
A.      For each Layer, the standard specifies the bitstream format 
        and the decoder. It does *not* specify the encoder to 
        allow for future improvements, but an informative chapter 
        gives an example for an encoder for each Layer.    

Q.      What have the three audio Layers in common?
A.      All Layers use the same basic structure. The coding scheme can 
        be described as "perceptual noise shaping" or "perceptual 
        subband / transform coding". 
        The encoder analyzes the spectral components of the audio 
        signal by calculating a filterbank or transform and applies 
        a psychoacoustic model to estimate the just noticeable 
        noise-level. In its quantization and coding stage, the 
        encoder tries to allocate the available number of data 
        bits in a way to meet both the bitrate and masking 
        requirements.
        The decoder is much less complex. Its only task is to 
        synthesize an audio signal out of the coded spectral 
        components. 
        All Layers use the same analysis filterbank (polyphase with 
        32 subbands). Layer-3 adds a MDCT transform to increase 
        the frequency resolution.
        All Layers use the same "header information" in their 
        bitstream, to support the hierarchical structure of the 
        standard.   
        All Layers use a bitstream structure that contains parts that 
        are more sensitive to biterrors ("header", "bit 
        allocation", "scalefactors", "side information") and parts 
        that are less sensitive ("data of spectral components").  
        All Layers may use 32, 44.1 or 48 kHz sampling frequency.
        All Layers are allowed to work with similar bitrates:
        Layer-1: from 32 kbps to 448 kbps
        Layer-2: from 32 kbps to 384 kbps
        Layer-3: from 32 kbps to 320 kbps

Q.      What are the main differences between the three Layers, from a 
        global view?
A.      From Layer-1 to Layer-3,
        complexity increases (mainly true for the encoder),
        overall codec delay increases, and
        performance increases (sound quality per bitrate).

Q.      Which Layer should I use for my application?
A.      Good Question. Of course, it depends on all your requirements. 
        But as a first approach, you should consider the available 
        bitrate of your application as the Layers have been 
        designed to support certain areas of bitrates most 
        efficiently, i.e. with a minimum drop of sound quality.   
        Let us look a little closer at the strong domains of each 
        Layer.    
        
        Layer-1: Its ISO target bitrate is 192 kbps per audio 
        channel.
        Layer-1 is a simplified version of Layer-2. It is most useful 
        for bitrates around the "high" bitrates around or above 
        192 kbps. A version of Layer-1 is used as "PASC" with the 
        DCC recorder.

        Layer-2: Its ISO target bitrate is 128 kbps per audio 
        channel.
        Layer-2 is identical with MUSICAM. It has been designed as 
        trade-off between sound quality per bitrate and encoder 
        complexity. It is most useful for bitrates around the 
        "medium" bitrates of 128 or even 96 kbps per audio 
        channel. The DAB (EU 147) proponents have decided to use 
        Layer-2 in the future Digital Audio Broadcasting network.   
   
        Layer-3: Its ISO target bitrate is 64 kbps per audio channel. 
        Layer-3 merges the best ideas of MUSICAM and ASPEC. It has 
        been designed for best performance at "low" bitrates 
        around 64 kbps or even below. The Layer-3 format specifies 
        a set of advanced features that all address one goal: to 
        preserve as much sound quality as possible even at rather 
        low bitrates. Today, Layer-3 is already in use in various 
        telecommunication networks (ISDN, satellite links, and so 
        on) and speech announcement systems. 

Q.      So how does MPEG audio work?
A.      Well, first you need to know how sound is stored in a 
        computer. Sound is pressure differences in air. When picked up 
        by a microphone and fed through an amplifier this becomes 
        voltage levels. The voltage is sampled by the computer a 
        number of times per second. For CD audio quality you need to 
        sample 44100 times per second and each sample has a resolution 
        of 16 bits. In stereo this gives you 1,4Mbit per second
        and you can probably see the need for compression.
 
        To compress audio MPEG tries to remove the irrelevant parts 
        of the signal and the redundant parts of the signal. Parts of 
        the sound that we do not hear can be thrown away. To do this 
        MPEG Audio uses psychoacoustic principles.
 
Q.      Tell me more about sound quality. How good is MPEG audio 
        compression? And how do you assess that?
A.      Today, there is no alternative to expensive listening tests. 
        During the ISO-MPEG-1 process, 3 international listening tests 
        have been performed, with a lot of trained listeners, 
        supervised by Swedish Radio. They took place in 7.90, 3.91 
        and 11.91. Another international listening test was 
        performed by CCIR, now ITU-R, in 92.      
        All these tests used the "triple stimulus, hidden reference" 
        method and the so-called CCIR impairment scale to assess the 
        audio quality. 
        The listening sequence is "ABC", with A = original, BC = pair 
        of original / coded signal with random sequence, and the 
        listener has to evaluate both B and C with a number 
        between 1.0 and 5.0. The meaning of these values is:
        5.0 = transparent (this should be the original signal)
        4.0 = perceptible, but not annoying (first differences 
              noticable)
        3.0 = slightly annoying   
        2.0 = annoying
        1.0 = very annoying
        With perceptual codecs (like MPEG audio), all traditional 
        parameters (like SNR, THD+N, bandwidth) are especially 
        useless. 

        Fraunhofer-IIS (among others) works on objective quality 
        assessment tools, like the NMR meter (Noise-to-Mask-Ratio), 
        too. If you need more informations about NMR, please 
        contact nmr@iis.fhg.de


Q.      Now that I know how to assess quality, come on, tell me the 
        results of these tests.
A.      Well, for details you should study one of those AES papers 
        listed below. One main result is that for low bitrates (60 
        or 64 kbps per channel, i.e. a compression ratio of around 
        12:1), Layer-2 scored between 2.1 and 2.6, whereas Layer-3 
        scored between 3.6 and 3.8. 
        This is a significant increase in sound quality, indeed! 
        Furthermore, the selection process for critical sound material 
        showed that it was rather difficult to find worst-case 
        material for Layer-3 whereas it was not so hard to find 
        such items for Layer-2.  
        For medium and high bitrates (120 kbps or more per channel), 
        Layer-2 and Layer-3 scored rather similar, i.e. even 
        trained listeners found it difficult to detect differences 
        between original and reconstructed signal.
 
Q.      So how does MPEG achieve this compression ratio?
A.      Well, with audio you basically have two alternatives. Either 
        you sample less often or you sample with less resolution (less 
        than 16 bit per sample). If you want quality you can't do much 
        with the sample frequency. Humans can hear sounds with 
        frequencies from about 20Hz to 20kHz. According to the Nyquist 
        theorem you must sample at least two times the highest 
        frequency you want to reproduce. Allowing for imperfect 
        filters, a 44,1kHz sampling rate is a fair minimum. So
        you either set out to prove the Nyquist theorem is wrong or 
        go to work on reducing the resolution. The MPEG committee 
        chose the latter.
        Now, the real reason for using 16 bits is to get a good 
        signal-to-noise (s/n) ratio. The noise we're talking 
        about here is quantization noise from the digitizing 
        process. For each bit you add, you get 6dB
        better s/n. (To the ear, 6dBu corresponds to a doubling of 
        the sound level.) CD-audio achieves about 90dB s/n. This 
        matches the dynamic range of the ear fairly well. That is, you 
        will not hear any noise coming from the system itself (well, 
        there is still some people arguing about that, but lets not 
        worry about them for the moment).
        So what happens when you sample to 8 bit resolution? You get 
        a very noticeable noise floor in your recording. You can 
        easily hear this in silent moments in the music or between 
        words or sentences if your recording is a human voice. 
        Waitaminnit. You don't notice any noise in loud passages, 
        right? This is the masking effect and is the key to MPEG Audio 
        coding. Stuff like the masking effect belongs to a science 
        called psycho-acoustics that deals with the way the human 
        brain perceives sound.
        And MPEG uses psychoacoustic principles when it does its 
        thing. 
        
Q.      Explain this masking effect.
A.      OK, say you have a strong tone with a frequency of 1000Hz. 
        You also have a tone nearby of say 1100Hz. This second tone is 
        18 dB lower. You are not going to hear this second tone. It is 
        completely masked by the first 1000Hz tone. As a matter of 
        fact, any relatively weak sounds near a strong sound is 
        masked. If you introduce another tone at 2000Hz also 18 dB 
        below the first 1000Hz tone, you will hear this.
        You will have to turn down the 2000Hz tone to something like 
        45 dB below the 1000Hz tone before it will be masked by the 
        first tone. So the further you get from a sound the less 
        masking effect it has.
        The masking effect means that you can raise the noise floor 
        around a strong sound because the noise will be masked anyway. 
        And raising the noise floor is the same as using less bits 
        and using less bits is the same as compression. Do you get it?
 
Q.      I don't get it.
A.      Well, let me try to explain how the MPEG Audio Layer-2 encoder 
        goes about its thing. It divides the frequency spectrum (20Hz 
        to 20kHz) into 32 subbands. Each subband holds a little slice 
        of the audio spectrum. Say, in the upper region of subband 8, 
        a 6500Hz tone with a level of 60dB is present. OK, the 
        coder calculates the masking effect of this sound and finds 
        that there is a masking threshold for the entire 8th
        subband (all sounds w. a frequency...) 35dB below this tone. 
        The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4 
        bit resolution. In addition there are masking effects on band 
        9-13 and on band 5-7, the effect decreasing with the distance 
        from band 8.
        In a real-life situation you have sounds in most bands and the 
        masking effects are additive. In addition the coder considers 
        the sensitivity of the ear for various frequencies. The ear 
        is a lot less sensitive in the high and low frequencies. Peak 
        sensivity is around 2 - 4kHz, the same region that the human 
        voice occupies. 
        The subbands should match the ear, that is each subband should
        consist of frequencies that have the same psychoacoustic 
        properties. In MPEG Layer 2, each subband is 750Hz wide 
        (with 48 kHz sampling frequency). It would have been better if
        the subbands were narrower in the low frequency range and 
        wider in the high frequency range. That is the trade-off 
        Layer-2 took in favour of a simpler approach.        
        Layer-3 has a much higher frequency resolution (18 times 
        more) - and that is one of the reasons why Layer-3 has a much 
        better low bitrate performance than Layer-2.                
        But there is more to it. I have explained concurrent masking, 
        but the masking effect also occurs before and after a strong 
        sound (pre- and postmasking).
 
Q.      Before?
A.      Yes, if there is a significant (30 - 40dB ) shift in level. 
        The reason is believed to be that the brain needs some 
        processing time. Premasking is only about 2 to 5 ms. The 
        postmasking can be up till 100ms.
        Other bit-reduction techniques involve considering tonal and 
        non-tonal components of the sound. For a stereo signal you 
        may have a lot of redundancy between channels. All MPEG 
        Layers may exploit these stereo effects by using a "joint-
        stereo" mode, with a most flexible approach for Layer-3.      
        Furthermore, only Layer-3 further reduces the redundancy 
        by applying huffmann coding. 
        
Q.      What are the downside?
A.      The coder calculates masking effects by an iterative process 
        until it runs out of time. It is up to the implementor to 
        spend bits in the least obtrusive fashion.
        For Layer 2 and Layer 3, the encoder works on 24 ms of sound 
        (with 1152 sample, and fs = 48 kHz) at a time. For some 
        material, the time-window can be a problem. This is 
        normally in a situation with transients where there are large
        differences in sound level over the 24 ms. The masking is 
        calculated on the strongest sound and the weak parts will 
        drown in quantization noise. This is perceived as a "noise-
        echo" by the ear. Layer 3 addresses this problem 
        specifically by using a smaller analysis window (4 ms), if 
        the encoder encounters an "attack" situation. 
        
Q.      Tell me about the complexity. What are the hardware demands? 

A.      Alright. First, we have to separate between decoder and 
        encoder. 
        Remember: the MPEG coding is done asymmetrical, with a much 
        larger workload on the encoder than on the decoder.
        For a stereo decoder, variuos real-time implementations exist 
        for Layer-2 and Layer-3. They are either based on single-DSP 
        solutions or on dedicated MPEG audio decoder chips. So
        you need not worry about decoder complexity.
        For a stereo Layer-2-encoder, various DSP based solutions with 
        one or more DSPs exist (with different quality, also).
        For a stereo Layer-3-encoder achieving ISO reference quality, 
        the current real-time implementations use two DSP32C and 
        two DSP56002. 
        
Q.      How many audio channels?
A.      MPEG-1 allows for two audio channels. These can be either 
        single (mono), dual (two mono channels), stereo or 
        joint stereo (intensity stereo (Layer-2 and Layer-3) or m/s-
        stereo (Layer-3 only)). 
        In normal (l/r) stereo one channel carries the left audio 
        signal and one channel carries the right audio signal. In
        m/s stereo one channel carries the sum signal (l+r) and the 
        other the difference (l-r) signal. In intensity stereo the 
        high frequency part of the signal (above 2kHz) is combined. 
        The stereo image is preserved but only the temporal envelope 
        is transmitted.
        In addition MPEG allows for pre-emphasis, copyright marks and
        original/copy marks. MPEG-2 allows for several channels in 
        the same stream.
 
Q.      What about the audio codec delay?
A.      Well, the standard gives some figures of the theoretical 
        minimum delay:
        Layer-1: 19 ms (<50 ms)
        Layer-2: 35 ms (100 ms)
        Layer-3: 59 ms (150 ms)
        The practical values are significantly above that. As they 
        depend on the implementation, exact figures are hard to 
        give. So the figures in brackets are just rough thumb 
        values.    
        Yes, for some applications, a very short delay is of critical 
        importance. E.g. in a feedback link, a reporter can only talk 
        intelligibly if the overall delay is below around 10 ms. 
        If broadcasters want to apply MPEG audio coding, they have to 
        use "N-1" switches in the studio to overcome this problem 
        (or appropriate echo-cancellers) - or they have to forget 
        about MPEG at all. 
        But with most applications, these figures are small enough to 
        present no extra problem. At least, if one can accept a Layer-
        2 delay, one can most likely also accept the higher Layer-3 
        delay.

Q.     OK, I am hooked on! Where can I find more technical 
       informations about MPEG audio coding, especially about Layer-
       3?   
A.     Well, there is a variety of AES papers, e.g.

       K. Brandenburg, G. Stoll, ...: "The ISO/MPEG-Audio Codec: A 
       Generic Standard for Coding of High Quality Digital Audio", 
       92nd AES, Vienna 1992, pp.3336
   
       E. Eberlein, H. Popp, ...: "Layer-3, a Flexible Coding 
       Standard",    94th AES, Berlin 93, pp.3493   
   
       K. Brandenburg, G. Zimmer, ...: "Variable Data-Rate Recording 
       on a PC Using MPEG-Audio Layer-3", 95th AES, New York 93
   
       B. Grill, J. Herre,... : "Improved MPEG-2 Audio Multi-Channel 
       Encoding", 96th AES, Amsterdam 94

       And for further informations, please contact layer3@iis.fhg.de


Q.     Where can I get more details about MPEG audio?
A.     Still more details? No shit. You can get the full ISO spec 
       from Omnicom. The specs do a fairly good job of obscuring 
       exactly how these things are supposed to work... Jokes aside, 
       there are no description of the coder in the specs. The specs 
       describes in great detail the bitstream and suggests 
       psychoacoustic models. 
 
Originally written by Morten Hjerde <100034,663@compuserve.com>, 

modified and updated by Harald Popp (layer3@iis.fhg.de).


Harald Popp
Audio & Multimedia ("Music is the *BEST*" - F. Zappa)
Fraunhofer-IIS-A, Weichselgarten 3, D-91058 Erlangen, Germany
Phone: +49-9131-776-340
Fax:   +49-9131-776-399
email: popp@iis.fhg.de