EFFICIENT SPECTRAL ENVELOPE CODING USING VARIABLE TIME/FREQUENCY RESOLUTION AND TIME/FREQUENCY SWITCHING
TECHNICAL FIELD
The present invention relates to a new method and apparatus for efficient coding of spectral envelopes in audio coding systems The method may be used both for natural audio coding and speech coding and is especially suited for coders usmg SBR [WO 98/57436] and other high frequency reconstruction methods
BACKGROUND OF THE INVENTION
Audio source codmg techniques can be divided into two classes natural audio codmg and speech codmg Natural audio codmg is commonly used for music or arbitrary signals at medium bitrates, and generally offers wide audio bandwidth Speech coders are basically limited to speech reproduction but can on the other hand be used at very low bitrates, albeit with low audio bandwidth In both classes, the signal is generally separated into two major signal components, the "spectral envelope" and the corresponding "residual" signal Throughout the following description, the term "spectral envelope" refers to the coarse spectral distribution of the signal in a general sense, e g filter coefficients m an linear prediction based coder or a set of time-frequency averages of subband samples in a subband coder The term "residual" refers to the fine spectral distribution in a general sense, e g the LPC error signal or subband samples normalized usmg the above time-frequency averages "Envelope data" refers to the quantized and coded spectral envelope, and "residual data" to the quantized and coded residual At medium and high bitrates, the residual data constitutes the ma part of the bitstream while the envelope data is merely a fraction At very low bitrates, the envelope data constitutes a comparably larger part of the bitstream Hence, it is indeed important to represent the spectral envelope compactly when using lower bitrates
Older prior art audio coders and most speech coders use static, relatively short, time segments in the generation of envelope data to achieve good temporal resolution However, this prevents from optimal utilisation of the frequency domain masking known from psycho-acoustics To improve coding gam through the use of narrow filterbands with steep slopes, and still achieve good temporal resolution during transient passages, modern audio coders employ adaptive window switching, I e they switch time segment lengths depending on the signals statistics Clearly a minimum usage of the short segments is a prerequisite for maximum coding gain Unfortunately, long transition windows are needed to alter the segment lengths, limiting the switching flexibility
The spectral envelope is a function of two variables time and frequency The encodmg can be done by exploiting redundancy m either direction of the time/frequency plane Generally, codmg of the spectral envelope is performed in the frequency direction usmg delta coding (DPCM), linear prediction (LPC), or vector quantization (VQ)
SUMMARY OF THE INVENTION
The present mvention provides a new method and an apparatus for spectral envelope encoding The mvention teaches how to perform and signal compactly a time/frequency mapping of the envelope representation, and further, encode the spectral envelope data efficiently using adaptive time/frequency direction codmg In the absence of transients, l e for quasi-stationary signals, a time/frequency grid with low temporal and high frequency resolution is used as default In the vicinity of transients, the temporal resolution is increased at the expense of frequency resolution The mvention describes two schemes for signalling of the time and frequency resolution used One scheme allows arbitrary selection of instantaneous resolution by explicit signalling of time segment borders and frequency resolutions, whereas the other exploits the fact that transients are separated at least by a minimum time, T mm, ∞ order to reduce the required number of control bits In the encoder, a transient detector decides whether the current granule contains a transient, and if so, determines the position of the onset of the transient The position withm the granule is encoded and sent to the decoder Both the encoder and decoder share rules that specify the time/frequency distribution of the spectral envelope samples, given a certain combination of subsequent control signals, ensuring an unambiguous decodmg of the envelope data The rules can be realised as a book of tables explicitly specifying the division of the current granule in terms of samples m the time/frequency plane The variable time/frequency resolution method is also applicable on envelope encoding based on prediction Instead of groupmg of subband samples, predictor coefficients are generated for time segments of varying lengths according to the system Different predictor orders may be used for transient and quasi-stationary (tonal) segments
The present mvention presents a new and efficient method for scalefactor redundancy codmg A dirac pulse in the time domain transforms to a constant in the frequency domain, and a drrac in the frequency domain, l e a single smusoid, corresponds to a signal with constant magnitude m the time domain Simplified, on a short term basis, the signal shows less variations in one domam than the other Hence, usmg prediction or delta codmg, codmg efficiency is mcreased if the spectral envelope is coded in either time- or frequency-direction depending on the signal characteπstics
BRIEF DESCRIPTION OF THE DRAWINGS
The present mvention will now be described by way of illustrative examples, not limiting the scope or spirit of the mvention, with reference to the accompanying drawings, in which
Figs la - lb illustrate uniform respective non-uniform sampling in time of the spectral envelope Figs 2a - 2c illustrate transient detector look-ahead and granule mterdependency
Figs 3a - 3f illustrate segments with different time and frequency resolutions, and the corresponding control signals Fig 4 illustrates time/frequency switched envelope codmg
Fig 5 is a block diagram of an encoder using the envelope coding according to the mvention Fig 6 is a block diagram of a decoder usmg the envelope codmg according to the invention
DESCRIPTION OF PREFERRED EMBODIMENTS
The below-described embodiments are merely illustrative for the principles of the present invention for efficient envelope codmg It is understood that modifications and variations of the arrangements and the details described herem will be apparent to others skilled in the art It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herem
Generation of Envelope Data
Most audio and speech coders have in common that both envelope data and residual data are transmitted and combined durmg the synthesis at the decoder Two exceptions are coders employing PNS ["Improving Audio
Codecs by Noise Substitution", D Schultz, JAES, vol 44, no 7/8, 1996], and coders employing SBR In case of SBR, considermg the highband, only the spectral course structure needs to be transmitted since a residual signal is reconstructed from the lowband This puts higher demands on how to generate envelope data, in particular due to lack of "timing" information contained m the original residual signal This problem will now be demonstrated by means of an example
Fig 1 shows the time/frequency representation of a musical signal where sustained chords are combmed with sharp transients with mainly high frequency contents In the lowband the chords have high energy and the transient energy is low, whereas the opposite is true m the highband The envelope data that is generated durmg time intervals where transients are present is dominated by the high intermittent transient energy At the SBR process m the decoder, the spectral envelope of the transposed signal is estimated using the same mstantaneous tιme-/frequency resolution as used for the analysis of the original highband An equalization of the transposed signal is then performed, based on dissimilarities m the spectral envelopes E g amplification factors in an envelope adjustmg filterbank are calculated as the quotients between oπgmal signal and transposed signal scalefactors For this kind of signal, a problem arises The transposed signal has the same chord to transient energy ratio as the lowband The gams needed in order to adjust the transposed transients to the correct level thus cause the transposed chords to be amplified relative the oπgmal highband level for the full duration of the envelope data containing transient energy These momentarily too loud chord fragments are perceived as pre- and post echoes to the transient, see Fig la This kind of distortion will hereinafter be referred to as "gam induced pre- and post echoes" The phenomenon can be eliminated by constantly updatmg the envelope data at such a high rate that the time between an update and an arbitrarily located transient is guaranteed to be short enough not to be resolved by the human hearing However, this approach would drastically increase the amount of data to be transmitted and is thus not practical
Therefore a new envelope data generation scheme is presented The prmcipal solution is to maintam a low update rate durmg tonal passages, which make up the majority of a typical programme material, and by means of a transient detector localize the transient positions, and update the envelope data close to the leading flanks, see Fig lb This eliminates gain induced pre-echoes In order to represent the decay of the transients well, the update rate is momentarily increased in a time interval after the transient start This eliminates gain induced post-echoes The time segmenting during the decay is not as crucial as findmg the start of the transient, as will be explained later In order
to compensate for the smaller time steps, a lower frequency resolution can be used durmg the transient, keeping the data size within limits A non-uniform samplmg in time and frequency as outlined above is applicable both on subband coders and linear prediction based coders
Some prior art coders employ vaπable time/frequency resolution as well In case of subband coders, this is commonly achieved through switching of the filterbank size Such a change m size can not take place immediately, so called transition wmdows are needed, and thus the update points can not be chosen freely When using SBR, the filterbank can be designed to meet both the highest temporal and highest frequency resolution needed Thus the varying time and frequency sampling can be obtained by grouping of the subband samples from a fixed filterbank m different ways In other words, by keeping the filterbank size constant, high frequency resolution or high time resolution can be obtamed mstantaneously In case of prediction based coders, no elaborate time/frequency resolution switching schemes are known from prior art
Typical coders operate on a block basis, where every block represents a fixed time interval Those blocks will be referred to as "granules" Let a granule have a length of q time quantization steps, hereinafter called "subgranules" In applications where there are non-critical delay restrictions, as m point to multipoint broadcastmg, a transient detector look-ahead can be employed on the encoder side Having this additional information, envelope data spanning across borders of granules can be comprised This enables a more flexible selection of time/frequency resolutions, and faciliates constant bitrate operation, smce parts of the payload can be moved between consecutive granules Referring to Fig 2, the granules are divided into eight subgranules The transient detector operates on granules with the same timespan as the granule that overlap 50% of two consecutive granules, that is, the transient detector look-ahead is half a granule The transient detector has detected a transient m subgranule 6 at time n-\, and a transient m subgranule 7 at time n With these values as input to the time/frequency resolution controlling algorithms, the corresponding time/frequency grid for granule n might be as shown in Fig 2c As seen from the figure, subgranule 7 of the granule at time n-\ is mcluded in the time/frequency grid of granule n Moreover, it is possible to use an analysis by synthesis approach, l e having a decoder in the encoder to assess the most beneficial time/frequency sampling
Control Signalling
In order to correctly interpret the received envelope data, the segment borders and frequency resolutions (number of coefficients or scalefactors) must be signalled If a non-uniform sampling according to Fig 2 is to be employed, the problem of envelope data spanning over the granule borders must be dealt with Furthermore, the signalling must be flexible enough to cover all combinations of interest, without generating a too large amount of control data
Theoretically, transients can occur withm a granule m C combinations, ranging from no transient at all to q transients, where C is given by
In order to signal C states, ln2(Q = ln2(2*) = q bits are required, corresponding to one bit per subgranule If different frequency resolutions are to be used in the segments, even more bits might be required m order to signal the frequency resolution chosen However, in low bitrate applications the number of control signal bits must be kept at a minimum The first step towards an efficient signalling is to employ two time sampling modes, uniform and non-uniform sampling in time The uniform mode is used during quasi-stationary passages, and employs high frequency resolution and relatively long time segments, both of which are predefined Hence this mode does not require any signalling of segment borders or frequency resolutions One bit is sufficient to signal the time samplmg mode to the decoder The non-uniform mode is used durmg transient passages and requires additional signalling Two such signalling systems are proposed by the present invention
The first system, hereinafter referred to as the "border-signalling system", uses one bit per subgranule to signal whether a segment border is present at the subgranule left border or not Envelope data corresponding to a segment is always sent m the granule in which the segment starts This means that the number of envelopes transmitted m a granule equals the number of left borders in the granule or the bit sum of the q border bits The segment frequency resolutions are signalled with dynamically allocated control bits, e g one bit per envelope Again, this number of bits is derived from the q border bits
Some examples of grouping of subgranules into time segments are given m Fig 3, where the subgranules are numbered from 000 to 111 L denotes low frequency resolution and H denotes high resolution In the example the number of scalefactors or coefficients m a high resolution segment is assumed to be two times that of a low resolution segment Figure 3a shows a reference system, constantly using the highest possible time and frequency resolution The relative data matrix size is one by definition, and obviously no control signal bits are needed in this system If no transient is present in or next to a specific granule, the granule is divided mto two segments of equal length and the envelope representations are calculated using high frequency resolution If the two envelope representations do not differ more than a certain amount, only one set of high resolution envelope data is sent Those cases are illustrated by Figs 3b and 3c, where the control signal "Uniform" tells that uniform sampling in time is used, and the signal "LowTime" mdicates whether one or two envelopes are sent Hence, the control signal overhead is two bits The - symbol means that the signal is not transmitted Figs 3d - 3f show some cases where a transient, denoted by T, is present The border-signalling system uses 8 bits to signal sub-granule left borders, and a varying number of bits to signal the frequency resolution within the sub-granules Those signals are called
"Borders" and "LowFreq" respectively The "TranPos" signal is not part of this system, and will be explained later The right border of the last segment m a granule equals the first left border in the subsequent granule P means that the corresponding envelope data was sent in the previous granule, Fig 3f The signalling overhead varies between 12 and 13 bits m Figs 3d - 3f Notice that the transient cases d and f generate the same data matrix size as the non- transient case b Furthermore, it is possible to design a scheme that keeps the matrix size constant, if desired For a typical programme material, the system has a performance similar to that of the reference system, at data matrix sizes of only 0 125 to 0 375 times the reference size Hence a major data reduction is achieved when using the dynamic selection of time- and frequency resolution according to the present invention
The second system, heremafter referred to as the "position-signalling system", is intended for very low bitrate applications and utilizes some musical signal properties m order to reduce the number of control signal bits As will
be shown below, many of the states described by Eq 1 are not very likely, and would also generate too large amounts of envelope data to be practical at a limited bitrate According to the present invention, the following simplifications can be made with little or no sacrifice of quality for practical signals
1 Only the transient start position needs to be transmitted The time and frequency grouping around this position can be handled by employing a set of rules in the encoder and decoder, which are based on the properties of typical transients
2 There exists a fixed rmmmum time-span between consecutive transients, l e transients can not be arbitrarily close to one another It is thus possible to introduce a blocking time in the transient detection/signalling system, reducing the number of states
The minimum time-span between consecutive transients in music programme material can be estimated in the following way In musical notation, the rhythmic "pulse" is described by a time signature expressed as a fraction A/B, where A denotes the number of "beats" per bar and MB is the type of note corresponding to one beat, for example a lA note, commonly referred to as a quarter note Let t denote the tempo in Beats Per Mmute (BPM) The time per note of type 1/C is then given by
Tn = (60/t)*(B/ [s] (Eq 2)
Most music pieces fall within the 70 - 160 BPM range, and in 4/4 time signature the fastest rhythmical patterns are for most practical cases made up from 1/32 or 32 nd notes This yields a minimum time T„mm = (60/160)*(4/32) = 47 ms Of course lower time periods than this may occur, but such fast sequences (>21 tones per second) almost get the character of buzz and need not be fully resolved
The necessary time resolution Tq must also be established In some cases a transient original signal has its mam energy in the highband to be reconstructed This means that the encoded spectral envelope must carry all the "timing" information The desired timing precision thus determines the resolution needed for encodmg of leading flanks Tq is much smaller than the minimum note period Tnmιn, smce small time deviations within the period clearly can be heard In most cases however, the transient has significant energy in the lowband The above described gain- induced pre-echoes must fall within the so called pre- or backward masking time Tm of the human auditory system m order to be inaudible Hence Tq must satisfy two conditions
Tq « T„mm (Eq 3) Tq < Tm (Eq 4)
Obviously Tm < T„mm (otherwise the notes would be so fast that they could not be resolved) and according to ["Modeling the Additivity of Nonsimultaneous Masking", Hearmg Res , vol 80, pp 105-118 (1994)], Tm amounts to 10-20 ms Smce T„mm is in the 50ms range, a reasonable selection of Tq according to Eq 3 results m that the second condition is also met Of course the precision of the transient detection in the encoder and the time resolution of the analysis/synthesis filterbank must also be considered when selecting Tq
Tracking of trailing flanks is less crucial, for several reasons First, the note-off position has little or no effect on the perceived rhythm Second, most instruments do not exhibit sharp trailing flanks, but rather a smooth decay curve, l e a well defined note-off time does not exist Third, the post- or forward masking time is substantially longer than the pre-masking time
Accordmg to the present invention, the above transient start information can be used for implicit signalling of segment borders and frequency resolutions immediately after/between transients This will now be described, again referring to Fig 3, assuming a granule length selected accordmg to 8Tq <= Tπmι„, 1 e a maximum of one transient is likely to occur withm a granule In this position-signalling system the "Borders" and "LowFreq" signals are replaced by a smgle signal, "TranPos", consistmg of three bits When a transient is present, the position within the granule is signalled by "TranPos", see Fig 3d - 3f This value, m combination with the control signals of the precedmg granule, determines the time/frequency grid used for the current granule These grids are described by rules or tables that are available to both the encoder and decoder Given the common tables and the control signals "Uniform" and either "LowTime" or "TranPos" of the current and the previous granule, unambiguous decodmg of the envelope data is ensured To put the saving obtained by the use of the position-signalling system mstead of the border- signalling system into perspective, a hypothetical low bitrate envelope encoder is studied Assume granules of length l6Tq <= Tnm , an average number of scalefactors per granule of 40 and an average number of bits per scalefactor of 3 due to lossless codmg The average number of segments in granules containing transients, n, is assumed to be 3 For transients, the signalling overheads are Bborde, = l + r7 + n = l + 16 + 3 = 20 and Bposmo„ = 1 + ceιl{ln2(16)} = 1 + 4 = 5 Thus the savmg is around 20 - 5 = 15 bits, corresponding to about 5 scalefactors or 12 5 % of the envelope data, I e it is significant at such low bitrates
Time/Frequency Switched Scalefactor Encoding Utilising a time to frequency transform it can be shown that a pulse in the time domain corresponds to a flat spectrum m the frequency domain, and a "pulse" in the frequency domain, l e a single sinusoidal, corresponds to a quasi-stationary signal in the time domain In other words a signal usually shows more transient properties m one domain than the other In a spectrogram, l e a time/frequency matrix display, this property is evident, and can advantageously be used when codmg spectral envelopes
A tonal stationary signal can have a very sparse spectrum not suitable for delta codmg m the frequency-direction, but well suited for delta codmg in the time-direction, and vice versa This is displayed in Fig 4 Throughout the following description a vector of scale factors calculated at time n0 represents the spectral envelope
Y(k,n
0)=[aι, a
2, a
3, , a
k, ,a
N], (Eq 5) where Ά a
N are the amplitude values for different frequencies Common practice is to code the difference between adjacent values in the frequency-direction at a given time, which yields
In order to be able to decode this, the start value a
j needs to be transmitted As stated above this delta-codmg scheme can prove to be most inefficient if the spectrum only contains a few stationary tones This can result m a delta codmg yielding a higher bit rate than regular PCM codmg In order to deal with this problem, a time/frequency switching method, hereinafter referred to as T/F-codmg, is proposed The scalefactors are quantized and coded both in the time- and frequency-direction For both cases, the required number of bits is calculated for a given coding error, or the error is calculated for a given number of bits Based upon this, the most beneficial codmg direction is selected
As an example, DPCM and Huffman redundancy coding can be used Two vectors are calculated, £ and D,
D, (k,n0)=[-ι(no)--ι(n0-l),?i2(no)--2(no- ), ,aN(n0)-aN(«o-l)] (Eq 8) The correspondmg Huffman tables, one for the frequency direction and one for the time direction, state the number of bits required m order to code the vectors The coded vector requirmg the least number of bits to code represents the preferable codmg direction The tables may initially be generated using some minimum distance as a time/frequency switching criterion
Start values are transmitted whenever the spectral envelope is coded m the frequency direction but not when coded in the time direction smce they are available at the decoder, through the previous envelope The proposed algorithm also require extra information to be transmitted, namely a time/frequency flag indicating in which direction the spectral envelope was coded The T/F algorithm can advantageously be used with several different coding schemes of the scalefactor-envelope representation apart from DPCM and Huffman, such as ADPCM, LPC and vector quantisation The proposed T/F algorithm gives significant bitrate-reduction for the spectral-envelope data, up to around 20% reduction compared to commonly used delta-coding techniques If the number of scalefactors per octave is constant, it is possible to delta code on an octave basis instead of delta codmg of adjacent scale factors
Practical implementations An example of the encoder side of the invention is shown in Fig 5 The analogue input signal is fed to an A/D- converter 501, forming a digital signal The digital audio signal is fed to a perceptual audio encoder 502, where source codmg is performed In addition, the digital signal is fed to a transient detector 503 and to an analysis filterbank 504, which splits the signal mto its spectral equivalents (subband signals) The transient detector could operate on the subband signals from the analysis bank, but for generality purposes it is here assumed to operate on the digital time domam samples directly The transient detector divides the signal mto granules and determines, accordmg to the invention, whether subgranules withm the granules is to be flagged as transient This information is sent to the envelope groupmg block 505, which specifies the time/frequency grid to be used for the current granule Accordmg to the grid, the block combmes the uniform sampled subband signals, to form the non-uniform sampled envelope values As an example, these values might be the average or maximum energy for the subband samples combined The envelope values are, together with the groupmg information, fed to the envelope encoder block 506 This block decides m which direction (time or frequency) to encode the envelope values The resultmg signals, the output from the audio encoder, the wideband envelope information, and the control signals are fed to the multiplexer 507, forming a serial bitstream that is transmitted or stored
The decoder side of the invention is shown in Fig 6 The demultiplexer 601 restores the signals and feeds the appropπate part to an audio decoder 602, which produces a low band digital audio signal The envelope information is fed from the demultiplexer to the envelope decodmg block 603, which, by use of control data, determines in which direction the current envelope are coded and decodes the data The low band signal from the audio decoder is routed to the transposition module 604, which generates a replicated high band signal consisting of one or several
harmomcs from the low band signal. The high band signal is fed to an analysis filterbank 606, which is of the same type as on the encoder side. The subband signals are combined m the scalefactor grouping umt 607. By use of control data from the demultiplexer, the same type of combination and time/frequency distribution of the subband samples is adopted as on the encoder side. The envelope information from the demultiplexer and the information from the scalefactor groupmg umt is processed m the gam control module 608. The module computes gam factors to be applied to the subband samples before recombmation in the synthesis filterbank block 609. The output from the synthesis filterbank is thus an envelope adjusted high band audio signal This signal is added to the output from the delay unit 605, which is fed with the low band audio signal. The delay compensates for the processmg time of the high band signal Finally, the obtamed digital wideband signal is converted to an analogue audio signal m the digital to analogue converter 610