US20020052734A1 - Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders - Google Patents
Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders Download PDFInfo
- Publication number
- US20020052734A1 US20020052734A1 US09/991,387 US99138701A US2002052734A1 US 20020052734 A1 US20020052734 A1 US 20020052734A1 US 99138701 A US99138701 A US 99138701A US 2002052734 A1 US2002052734 A1 US 2002052734A1
- Authority
- US
- United States
- Prior art keywords
- pitch
- coder
- speech
- frame position
- plosive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/10—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
Definitions
- the present invention relates to speech signal coding using a parametric coder to model a speech waveform.
- the speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder.
- MELP mixed excitation linear predictive
- FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a one-way communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system.
- the first widely-used low bit-rate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an all-pole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy.
- LPC linear predictive coding
- the LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 20-30 ms and transmitted to the communications channel.
- the artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise.
- the speech quality is very poor in the presence of background noise.
- the mixed excitation linear predictive (MELP) coder is one of these speech coders.
- the MELP coder is a linear-prediction-based speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder.
- FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively.
- the MELP still has some perceivable distortions, particularly around the non-stationary speech segments and for some low-pitch male speakers. These distortions can also be observed with other low bit-rate speech coders.
- the distortion around the non-stationary speech segments results from the update of speech parameters at a low frame rate (typically 30-50 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate.
- Another possible solution is a variable frame-rate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments.
- Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce high-quality speech while maintaining a relatively low bit rate by reducing the average frame rate.
- this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as two-way voice communications).
- the distortion for low-pitch male speakers in the MELP is characterized by a high-pass filtered quality of the coded speech.
- the synthesized speech lacks “sound pressure” in the low frequencies.
- This distortion is caused by a post filter and a preprocessing high-pass filter, which are used in the modern low bit-rate speech coders to remove 60 Hz noise and to enhance the coded speech quality.
- These filters suppress the harmonic magnitudes in the low frequencies, particularly for low-pitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a high-pass filtered speech that is perceived as too synthetic.
- Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds.
- Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”).
- Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bit-rate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear.
- an object of the present invention is to enhance the coded speech quality of the existing low bit-rate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity.
- Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder.
- Another object of the present invention is to provide bit-stream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated.
- the present invention provides four embodiments.
- the first is a robust pitch-detection algorithm.
- the fixed-length pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixed-position windows in the prior art.
- the second embodiment is a plosive analysis/synthesis method.
- the system first detects the frame that contains the plosive signal.
- the plosive detection is performed with sliding-window peakiness analysis.
- the detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder.
- the plosive signal is synthesized independently and added back to the coded speech.
- the third embodiment is a post-processor for the Fourier magnitude model.
- the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the high-pass filtered speech of current low bit-rate speech encoders.
- the fourth embodiment is a new mixed-excitation algorithm.
- a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the band-pass filtering operations, which are required to generate the mixed-excitation signal in the existing MELP coder.
- the elimination of the filters results in a significant reduction of computational complexity in the MELP decoder.
- the present system is shown to be compatible in terms of bit-stream and is interchangeable with the coder/decoder of the existing MELP speech coder.
- FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder
- FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A;
- FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A;
- FIG. 2A is a block diagram highlighting the new embodiments of the present system
- FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A;
- FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A;
- FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art
- FIG. 3B illustrates plosive signal synthesis in coded speech
- FIG. 3C illustrates a typical LPC residual waveform for a plosive signal
- FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model
- FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model
- FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B;
- FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C;
- FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5;
- FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5;
- FIG. 8 illustrates a block diagram of the post-processor for the Fourier magnitude of the present invention as shown in FIG. 2C;
- FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C;
- FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames
- FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames.
- FIG. 12 illustrates words with plosive sounds
- FIG. 13 illustrates the replacement of different plosive types in the present invention
- FIG. 14 reveals the bit allocation for the plosive signal model
- FIG. 15 reveals the 99-level Pitch and voicingng level quantization in the existing MELP
- FIG. 16A reveals the bit allocation in the existing MELP frame
- FIG. 16B reveals the bit transmission order in the existing MELP frame.
- the present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bit-rate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention.
- the present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as 52 in FIG. 2A. The robust pitch detector 52 replaces a portion of the refinement of pitch and voicing decision 37 in the MELP coder and does not require additional bits for transmission.
- FIG. 2A The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A.
- Plosive analysis 55 is added to the encoder.
- Plosive synthesis 59 is added to the decoder and requires two bits for transmission.
- the third embodiment a post-processor for the Fourier magnitude 62 , is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission.
- the fourth embodiment a new mixed excitation 35 , is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art.
- the new mixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission.
- FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder.
- a frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second.
- the MELP is a parametric speech coder that creates a 54-bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Band-pass voicingng, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit.
- Input speech is encoded as follows. First, the input speech signal is processed through high-pass filter 11 with a cut-off frequency of 60 Hz to remove low-frequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations.
- the speech signal is band-passed filtered into 5 frequency bands from 0-500, 500-100, 1000-2000, 2000-3000, and 3000-4000 Hz for voicing analysis.
- An initial pitch estimation is made using the 0-500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame.
- the initial pitch estimation from the first band-pass filter is used as the initial reference point for robust pitch detector 52 (FIG. 2B).
- the band-pass voicing strength is determined using the pitch determined by the robust pitch detector 52 described below.
- the time envelopes of each of the band-pass filters are calculated by full-wave rectification followed by a smoothing filter.
- the analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band.
- S k is the kth sample in the fixed-position window
- s O is the signal at the center of the fixed-position window
- T is a pitch lag
- N is the number of samples accumulated for the correlation computation.
- the binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or ill-defined pitch.
- noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions.
- a sliding-sample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis.
- the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the non-periodically stable signal segments.
- FIG. 4 shows a robust pitch detector used in the present invention.
- the normalized pitch correlation in the window 43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, S k is the k th signal and s 0 is the signal at the center of the original fixed-position window.
- Ns is the maximum window-sliding range from the original fixed-position window.
- an LPC parameter, a gain, band-pass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation.
- Equation (2) solving for r i (T) for all values of i would result in a significant increase in the computational complexity.
- the recursion Equation (2) for c T (i, j) is used to compute the autocorrelation.
- the aperiodic flag is set if V bpl , determined in the voicing analysis for the 0 to 500 Hz band-pass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.
- a 10 th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame.
- a traditional autocorrelation analysis procedure is implemented using Levinson-Durbin recursion.
- a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.
- a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal.
- the plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis.
- FIG. 5 shows the plosive analysis/synthesis system.
- the plosive detector 56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal.
- the peakiness value is sensitive to the phase of the plosive signal.
- r n is a LPC residual signal and N is a frame size.
- the plosive detector slides the peakiness analysis window 63 to find the maximum peakiness value in all windows.
- 1 N ⁇ B i 1 N ⁇ A i , Eq . ⁇ ( 5 )
- a l A i ⁇ 1 +
- r N ⁇ 1 i
- N s is the maximum window-sliding range, which is also used for the pitch detector of the present invention.
- the peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform.
- the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.
- FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency.
- FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention.
- g p is the scaling factor based on the energy of the input plosive signal
- a i are the LPC coefficients computed from the input plosive signal.
- the template plosive signal v(n) was chosen arbitrarily and filtered with the 14 th order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10 th order LPC model used for voiced pitch modeling is used for the production of the plosive signal.
- the parameters for transmission are a plosive flag, a plosive location, and plosive gain.
- the gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal.
- the gain is quantized with two bits.
- the position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame.
- the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14.
- plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder.
- the input speech signal gain is measured twice per frame using a pitch adaptive window length.
- This adaptive length is identical for both gain measurements and is determined as follows.
- V bp1 >0.6 the length is the shortest multiple of P 2 which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2.
- V bpl is less than or equal to 0.6, the window length is 120 samples.
- the gain calculation for the first window produces G 1 and is centered 90 samples before the last sample of the current frame.
- the calculation for the second window produces G 2 and is centered on the last sample of the current frame.
- L is the window length.
- the 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is ⁇ 32768 to 32767.
- the encoder performs a quantization of the LPC coefficients.
- the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation.
- the resulting LSF vector f is quantized using a multi-stage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder.
- the final pitch value, P 3 is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if V bpl is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.
- the two gain values are quantized as follows.
- G 2 is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB.
- G 1 is quantized to 3 bits using the following adaptive algorithm. If G 2 for the current frame is within 5 dB of G 2 for the previous frame, and G 1 is within 3 dB of the average of G 2 values for the current and previous frames, then the frame is steady-state and a code of all zeros is sent to indicate that the decoder should set G 1 to the mean of G 2 values for the current and previous frames. Otherwise, the frame represents a transition and G 1 is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G 2 values for the current and previous frames to 6 dB above the maximum of those G 2 values.
- Fourier Magnitude calculation and quantization occurs as follows.
- FFT Fast Fourier Transform
- a set of quantized predictor coefficients are calculated from the quantized LSF vector.
- the residual window is generated using the quantized prediction coefficients.
- a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed.
- the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peak-selecting algorithm.
- the peak-selecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer.
- the initial estimate for the location of the i th harmonic is 512 i/P.
- the number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.
- the 10 magnitudes are quantized with an 8-bit quantizer.
- the codebook is searched for a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies.
- FIG. 12 shows the bit allocation for the MELP coder.
- the unused coder parameters for the unvoiced mode are replaced with forward error correction.
- Three Hamming (7,4) codes and one Hamming (8,4) code may be used.
- the (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors.
- the (8,4) code is applied to the 4 most significant bits (MSBs) of the first multi-stage vector quantization index, and the 4 parity bits are written over the band-pass voicing.
- MSBs most significant bits
- the remaining three bits of the first multi-stage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index.
- the 4 MSBs of the G 2 codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes.
- the least significant bit (LSB) of the second gain index and the 3 bit G 1 codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit.
- FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame.
- FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes.
- the received bit stream is unpacked from the communications channel 18 and assembled into the parametric codewords.
- Parameter decoding differs for the voiced and unvoiced frames.
- Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used.
- the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors.
- the remaining parameters are decoded.
- the LSFs are checked for ascending order and a minimum separation of 50 Hz.
- default parameter values are used for the pitch, jitter, band-pass voicing, and Fourier magnitudes.
- the pitch value is set to 50 samples, the jitter is set to 25%, the band-pass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0.
- V bpl is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%.
- the band-pass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.
- Gain, G 1 is then modified by subtracting a positive correction term, G att , given in dB by:
- All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period.
- the interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral-tilt coefficient for the adaptive spectral-enhancement filter.
- the other parameters are linearly interpolated between the past and current frame values.
- the interpolation factor, int, for these parameters is based on the starting point of the new pitch period:
- G int is the interpolated gain. This interpolation factor is then clamped between 0 and 1.
- FIG. 9 shows a new mixed-excitation algorithm in the present invention.
- the existing MELP uses the Fourier magnitudes to generate a pulse train.
- the pulse train is mixed with random noise in time domain by band-pass filtering.
- noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes.
- Block 64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands.
- cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitch-synchronously, the band-pass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced).
- the adaptive spectral enhancement filter is then applied to the mixed excitation signal.
- This filter is a 10 th order pole/zero filter with additional first order tilt compensation.
- the coefficients are generated by bandwidth expansion of the LPC filter transfer function A(z), corresponding to the interpolated LSFs.
- tilt coefficient, ⁇ is first calculated as max(0.5k l 0), then interpolated and multiplied by p, the signal probability.
- the first reflection coefficient, k l is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, k l , is usually negative for the voiced spectra.
- Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter.
- this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period.
- the pulse dispersion filter is a 65 th order FIR filter derived from a spectrally flattened triangular pulse.
- the coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.
- a post-processor for the Fourier magnitude model 62 is added to the MELP decoder as shown in FIG. 2A.
- the preprocessing high-pass filter 11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF) 30 in FIG. 2C It was found that this effect led to a high-pass filtered quality for low-pitch male speakers.
- the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters.
- ⁇ S ⁇ ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) ⁇ ⁇ S ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) ⁇ ⁇ G H ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) , Eq . ⁇ ( 21 )
- H 1 (e j ⁇ ) and H 2 (e j ⁇ ) are the magnitude responses of the ASEF 30 and preprocessing high-pass filter 11 respectively.
- the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame.
- FIG. 7 shows the block diagram of the plosive synthesis 66 .
- all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosive residual template 71 , which is pre-stored in the synthesizer.
- This plosive residual template 71 was chosen arbitrarily and filtered with the 14 th order LPC inverse filter.
- the LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis.
- the gain of synthesized plosive signal is adjusted by applying plosive gain 76 to the MELP gain 34 .
- the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown in block 73 .
- the gain of the coded speech is adjusted in gain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame:
- Another advantage of the present invention is bit-stream compatibility with the existing MELP coder.
- the present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a post-processor for the Fourier magnitude model and a new mixed-excitation algorithm.
- a robust pitch detector As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission.
- the additional bits for the plosive can be packed into the bit-stream of the existing MELP.
- the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced.
- the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced.
- a frame that contains a plosive is assumed to be a unvoiced frame.
- FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy.
- FIG. 11 shows the bit unpacking flow diagram for the plosive signal.
- the decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP.
- the frame is regarded as the plosive frame if this combination occurs.
- the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bit-stream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP.
Abstract
A system and method for enhancing the speech quality of the mixed-excitation linear predictive (MELP) coder and other low bit-rate speech coders are disclosed. The system includes a robust pitch-detection algorithm, which adjusts or slides a pitch-analysis window to provide the speech coder with more reliable pitch information. In addition, the system is shown to be compatible with the existing MELP coder in terms of the bit stream.
Description
- This application is a divisional application of a co-pending U.S. Utility Application, entitled, “Apparatus and Quality Enhancement Algorithm for Mixed Excitation Linear Predictive (MELP) and Other Speech Coders,” to Unno et al., filed Sep. 29, 1999, granted Ser. No. 09/408,195, which is incorporated herein by reference in its entirety.
- The present invention relates to speech signal coding using a parametric coder to model a speech waveform. The speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder.
- Low bit-rate speech coding technology is widely used for digital voice communication in narrow-bandwidth channels. The common objective of this technology is to transfer the digital speech signal information at a low bit rate (typically 2,400 bits/sec) while providing good quality speech synthesis at the destination. This technology also strives to provide low computational complexity, low memory requirements, and a small algorithmic delay particularly for real-time low-cost voice communications. FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a one-way communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system.
- The first widely-used low bit-rate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an all-pole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy.
- The LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 20-30 ms and transmitted to the communications channel. The artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise. In addition, the speech quality is very poor in the presence of background noise. These unintended additions to the synthesized speech are the result of the simple excitation model and the binary voicing decision error.
- Over the years, several low bit-rate speech coding algorithms have been developed, and some state-of-the-art coders now provide a good natural quality. The mixed excitation linear predictive (MELP) coder is one of these speech coders. The MELP coder is a linear-prediction-based speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder. FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively.
- However, the MELP still has some perceivable distortions, particularly around the non-stationary speech segments and for some low-pitch male speakers. These distortions can also be observed with other low bit-rate speech coders. The distortion around the non-stationary speech segments results from the update of speech parameters at a low frame rate (typically 30-50 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate. Another possible solution is a variable frame-rate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments. Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce high-quality speech while maintaining a relatively low bit rate by reducing the average frame rate. However, this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as two-way voice communications).
- The distortion for low-pitch male speakers in the MELP is characterized by a high-pass filtered quality of the coded speech. In other words, the synthesized speech lacks “sound pressure” in the low frequencies. This distortion is caused by a post filter and a preprocessing high-pass filter, which are used in the modern low bit-rate speech coders to remove 60 Hz noise and to enhance the coded speech quality. These filters suppress the harmonic magnitudes in the low frequencies, particularly for low-pitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a high-pass filtered speech that is perceived as too synthetic.
- The most significant speech distortion present in the prior art is the lack of a suitable model or method to accurately synthesize a plosive sound. Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds. Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”). Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bit-rate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear.
- As described briefly, an object of the present invention is to enhance the coded speech quality of the existing low bit-rate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity.
- Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder. Another object of the present invention is to provide bit-stream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated.
- The present invention provides four embodiments. The first is a robust pitch-detection algorithm. In the encoder, the fixed-length pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixed-position windows in the prior art.
- The second embodiment is a plosive analysis/synthesis method. In the encoder, the system first detects the frame that contains the plosive signal. The plosive detection is performed with sliding-window peakiness analysis. The detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder. In the decoder, the plosive signal is synthesized independently and added back to the coded speech.
- The third embodiment is a post-processor for the Fourier magnitude model. In the decoder, the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the high-pass filtered speech of current low bit-rate speech encoders.
- The fourth embodiment is a new mixed-excitation algorithm. In the decoder, a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the band-pass filtering operations, which are required to generate the mixed-excitation signal in the existing MELP coder. The elimination of the filters results in a significant reduction of computational complexity in the MELP decoder. As a result, the present system is shown to be compatible in terms of bit-stream and is interchangeable with the coder/decoder of the existing MELP speech coder.
- The present invention will be more fully understood from the accompanying drawings of the embodiments of the invention, which however, should not be taken to limit the invention to the specific embodiments enumerated, but are for explanation and for better understanding only. Finally, like reference numerals in the figures designate corresponding parts throughout the drawings.
- FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder;
- FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A;
- FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A;
- FIG. 2A is a block diagram highlighting the new embodiments of the present system;
- FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A;
- FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A;
- FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art;
- FIG. 3B illustrates plosive signal synthesis in coded speech;
- FIG. 3C illustrates a typical LPC residual waveform for a plosive signal;
- FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model;
- FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model;
- FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B;
- FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C;
- FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5;
- FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5;
- FIG. 8 illustrates a block diagram of the post-processor for the Fourier magnitude of the present invention as shown in FIG. 2C;
- FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C;
- FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames;
- FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames.
- FIG. 12 illustrates words with plosive sounds;
- FIG. 13 illustrates the replacement of different plosive types in the present invention;
- FIG. 14 reveals the bit allocation for the plosive signal model;
- FIG. 15 reveals the 99-level Pitch and Voicing level quantization in the existing MELP;
- FIG. 16A reveals the bit allocation in the existing MELP frame; and
- FIG. 16B reveals the bit transmission order in the existing MELP frame.
- The present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bit-rate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention. The present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as52 in FIG. 2A. The
robust pitch detector 52 replaces a portion of the refinement of pitch and voicingdecision 37 in the MELP coder and does not require additional bits for transmission. - The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A.
Plosive analysis 55 is added to the encoder.Plosive synthesis 59 is added to the decoder and requires two bits for transmission. - The third embodiment, a post-processor for the
Fourier magnitude 62, is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission. - The fourth embodiment, a new
mixed excitation 35, is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art. The newmixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission. - MELP Encoder
- FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder. A frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second. The MELP is a parametric speech coder that creates a 54-bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Band-pass Voicing, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit.
- Input speech is encoded as follows. First, the input speech signal is processed through high-
pass filter 11 with a cut-off frequency of 60 Hz to remove low-frequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations. - Next, the speech signal is band-passed filtered into 5 frequency bands from 0-500, 500-100, 1000-2000, 2000-3000, and 3000-4000 Hz for voicing analysis. An initial pitch estimation is made using the 0-500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame. The initial pitch estimation from the first band-pass filter is used as the initial reference point for robust pitch detector52 (FIG. 2B). For each of the remaining frequency bands, the band-pass voicing strength is determined using the pitch determined by the
robust pitch detector 52 described below. The time envelopes of each of the band-pass filters are calculated by full-wave rectification followed by a smoothing filter. The analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band. - Robust Pitch Detection
- Most low bit-rate speech coders use the normalized pitch correlation to estimate pitch lag. In the MELP coder, the pitch correlation is also used to make band-pass voicing decisions. The normalized pitch correlation r(T) is computed with the signal in the fixed-position analysis window in the prior art as follows:
- where, Sk is the kth sample in the fixed-position window, sO is the signal at the center of the fixed-position window, T is a pitch lag, and N is the number of samples accumulated for the correlation computation.
- The binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or ill-defined pitch. As a result, noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions. To solve this problem, a sliding-sample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis. By using a periodically stable portion of the signal for pitch analysis, the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the non-periodically stable signal segments.
- FIG. 4 shows a robust pitch detector used in the present invention. In FIG. 4, the normalized pitch correlation in the
window 43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, Sk is the kth signal and s0 is the signal at the center of the original fixed-position window. The normalized pitch correlation in thewindow 43 is computed recursively as follows: -
- where, Ns is the maximum window-sliding range from the original fixed-position window. In the present invention, an LPC parameter, a gain, band-pass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation. A direct implementation of Equation (2) solving for ri (T) for all values of i would result in a significant increase in the computational complexity. To reduce the additional complexity, the recursion Equation (2) for cT (i, j) is used to compute the autocorrelation.
- The aperiodic flag is set if Vbpl, determined in the voicing analysis for the 0 to 500 Hz band-pass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.
- A 10th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame. A traditional autocorrelation analysis procedure is implemented using Levinson-Durbin recursion. In addition, a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.
- Next, a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal.
- Plosive Analysis
- The plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis. FIG. 5 shows the plosive analysis/synthesis system.
- Plosive Detection
- With reference to FIG. 5, the
plosive detector 56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal. The peakiness value is sensitive to the phase of the plosive signal. By using a sliding window to detect a window position that maximizes the peakiness value, the phase sensitivity of the plosive is reduced. The peakiness, P, is defined as a ratio of the L2 norm to the L1 norm of the signal: -
- where, Pi is the peakiness of the ith window from the past, and r0 is the first LPC residual signal in the original fixed-position window. In FIG. 6, the peakiness in the window 63 (P−Ns) is first computed. The peakiness in the
window 63 is computed recursively as follows: - A l =A i−1 +|r N−1=i |−|r i−1|
- Bi =B i−1 =r N−1=i 2 −r t−1 2, Eq. (6)
-
- where, Ns is the maximum window-sliding range, which is also used for the pitch detector of the present invention. The peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform. In addition to the peakiness value, the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.
- Plosive Modeling
- In the present invention, a simple model is applied to the plosive signal expression in
plosive modeling 57 of FIG. 5 so as to minimize the additional transmission bits. FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency. FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention. -
- where, gp is the scaling factor based on the energy of the input plosive signal, and ai are the LPC coefficients computed from the input plosive signal. The template plosive signal v(n) was chosen arbitrarily and filtered with the 14th order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10th order LPC model used for voiced pitch modeling is used for the production of the plosive signal.
- The parameters for transmission are a plosive flag, a plosive location, and plosive gain. The gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal. For the specific embodiment of the present invention, the gain is quantized with two bits. The position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame. Thus, for the specific embodiment of the present invention, the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14. In the present invention, plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder.
-
- where, L is the window length. The 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is −32768 to 32767.
- Next, the encoder performs a quantization of the LPC coefficients. First, the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation. The resulting LSF vector f is quantized using a multi-stage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder.
- The final pitch value, P3, is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if Vbpl is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.
- The two gain values are quantized as follows. G2 is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB. G1 is quantized to 3 bits using the following adaptive algorithm. If G2 for the current frame is within 5 dB of G2 for the previous frame, and G1 is within 3 dB of the average of G2 values for the current and previous frames, then the frame is steady-state and a code of all zeros is sent to indicate that the decoder should set G1 to the mean of G2 values for the current and previous frames. Otherwise, the frame represents a transition and G1 is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G2 values for the current and previous frames to 6 dB above the maximum of those G2 values.
- Band-pass voicing quantization occurs as follows. When Vbpl is less than or equal to 0.6 (unvoiced state), the remaining strengths Vbpi, i=2, 3, 4, 5 are set to 0. When Vbpl is >0.6, the remaining voicing strengths are quantized to 1.
- Fourier Magnitude calculation and quantization occurs as follows. The Fourier magnitudes of the first 10 pitch harmonics of the prediction signal residual generated by the quantized prediction coefficients. It uses a 512 point Fast Fourier Transform (FFT) of a 200 sample window centered at the end of the frame. First, a set of quantized predictor coefficients are calculated from the quantized LSF vector. Then, the residual window is generated using the quantized prediction coefficients. Next, a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed. Finally, the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peak-selecting algorithm.
- The peak-selecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer. The initial estimate for the location of the ith harmonic is 512 i/P. The number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.
-
- where,fi=8000i/60 is the frequency in Hz corresponding to the ith harmonic for a default pitch period of 60 samples. The weights are applied to the squared difference between the input Fourier magnitudes and the codebook values.
- Lastly, the MELP encoder adds error protection and structures the 54-bit frame as follows. FIG. 12 shows the bit allocation for the MELP coder. To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with forward error correction. Three Hamming (7,4) codes and one Hamming (8,4) code may be used. The (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors. The (8,4) code is applied to the 4 most significant bits (MSBs) of the first multi-stage vector quantization index, and the 4 parity bits are written over the band-pass voicing. The remaining three bits of the first multi-stage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index. The 4 MSBs of the G2 codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes. Finally, the least significant bit (LSB) of the second gain index and the 3 bit G1 codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit. The parity generator matrix for the Hamming (7,4) code is:
-
- FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame. FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes.
- MELP Decoder
- The received bit stream is unpacked from the
communications channel 18 and assembled into the parametric codewords. Parameter decoding differs for the voiced and unvoiced frames. Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used. - In the unvoiced mode, the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors.
- If an erasure is indicated in the current frame, by the Hamming code, by the pitch code, or directly signaled from the
communication channel 18, then a frame repeat mechanism is implemented. All of the parameters for the current frame are replaced with the parameters from the previous frame. In addition, the first gain term is set equal to the second gain term so that no gain transitions are permitted. - If an erasure is not indicated, the remaining parameters are decoded. The LSFs are checked for ascending order and a minimum separation of 50 Hz. In the unvoiced mode, default parameter values are used for the pitch, jitter, band-pass voicing, and Fourier magnitudes. The pitch value is set to 50 samples, the jitter is set to 25%, the band-pass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0. In the voiced mode, Vbpl is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%. The band-pass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.
- When the special all zero code for the first gain parameter G1, is received, some errors in the second gain parameter, G2, can be detected and corrected. This correction process provides improved performance in channel errors.
- For quiet input signals, a small amount of gain attenuation is applied to both gain parameters using a power subtraction rule. This attenuation is a simplified, frequency invariant case of a smooth spectral subtraction noise suppression method. The background noise estimate is also used in the adaptive spectral enhancement calculation.
- Gain, G1, is then modified by subtracting a positive correction term, Gatt, given in dB by:
- G att=−10log10(1−1001[G n +3−G 1 ]). Eq. (13)
- All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period. The interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral-tilt coefficient for the adaptive spectral-enhancement filter. Gain is linearly interpolated between the gain of the prior frame, G2p, and the first gain of the current frame, G1, if the starting point, t0, t0=0, 1, . . . , 179, of the new pitch period is less than 90; otherwise, gain is interpolated between the G1 and G2. Normally, the other parameters are linearly interpolated between the past and current frame values. The interpolation factor, int, for these parameters is based on the starting point of the new pitch period:
- int=t 0 /180 Eq. (14)
-
- where Gint is the interpolated gain. This interpolation factor is then clamped between 0 and 1.
- New Mixed Excitation Algorithm
- Although the mixed excitation method in the existing MELP coder minimizes the band-pass filtering operations, it still requires two 32nd order FIR filtering operations for a pulse train and noise. The present invention removes these filters to reduce the computational complexity of the existing MELP. FIG. 9 shows a new mixed-excitation algorithm in the present invention. The existing MELP uses the Fourier magnitudes to generate a pulse train. The pulse train is mixed with random noise in time domain by band-pass filtering. In the present invention, noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes.
Block 64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands. The mixed excitation signal in the present method is given by: - If, ω=0, ω=π, or in the voiced band,
- otherwise,
- E m(e jω)=E 0(e jω)e jω100 , φ=U[−απ, απ], Eq. (16)
- where, cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitch-synchronously, the band-pass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced).
-
- where,
- α=0.5p β=0.8p′ Eq. (18)
-
- This signal probability is clamped between 0 and 1.
- Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter.
-
- To prevent discontinuities in the synthesized speech, this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period.
- The pulse dispersion filter is a 65th order FIR filter derived from a spectrally flattened triangular pulse. The coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.
- Post-Processor for the Fourier Magnitude Model
- In the present invention, a post-processor for the
Fourier magnitude model 62 is added to the MELP decoder as shown in FIG. 2A. In the prior art, it was observed that the first few harmonic magnitudes of the coded speech for some low-pitch male speakers were suppressed by the preprocessing high-pass filter 11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF) 30 in FIG. 2C. It was found that this effect led to a high-pass filtered quality for low-pitch male speakers. To provide more natural speech quality for such speakers, the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters. The emphasized harmonic magnitude is given by: -
- where, h(n) is the impulse response of the filter H(ejω), and N is the length of impulse response. The magnitude response of the filter |H(ejω)|, is given by:
- |H(e jω)|=|H 1(e jω)||H 2(e jω)|, Eq. (23)
- where, H1 (ejω) and H2 (ejω) are the magnitude responses of the
ASEF 30 and preprocessing high-pass filter 11 respectively. To avoid losing the advantage of theASEF 30 in the prior art, the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame. The first formant frequency F1 is roughly estimated using quantized line spectrum frequencies (LSFs) as follows: -
- where, {circumflex over (f)}i is the ith quantized LSF. From the experimental result, the emphasized harmonic magnitude |{tilde over (S)}(ejω)| is further emphasized by 2 dB in the present invention.
- Plosive Synthesis
- FIG. 7 shows the block diagram of the
plosive synthesis 66. As shown in FIG. 7, all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosiveresidual template 71, which is pre-stored in the synthesizer. This plosiveresidual template 71 was chosen arbitrarily and filtered with the 14th order LPC inverse filter. The LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis. The gain of synthesized plosive signal is adjusted by applyingplosive gain 76 to theMELP gain 34. In the present invention, the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown inblock 73. Before the plosive is added back to the coded speech, the gain of the coded speech is adjusted ingain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame: -
- Bit Allocation
- Another advantage of the present invention is bit-stream compatibility with the existing MELP coder. The present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a post-processor for the Fourier magnitude model and a new mixed-excitation algorithm. As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission. In the present invention, the additional bits for the plosive can be packed into the bit-stream of the existing MELP. There are two different modes for the bit allocation of the existing MELP: one voiced, the other unvoiced. The mode is selected as voiced if the first band is voiced and as unvoiced if the first band is unvoiced. For unvoiced mode, the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced. In the decoder, if the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced. In the present invention, a frame that contains a plosive is assumed to be a unvoiced frame. FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy. Then, a plosive gain and position is packed into the bits for the Fourier magnitude, which is used for the voiced frame in the existing MELP. FIG. 11 shows the bit unpacking flow diagram for the plosive signal. The decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP. In the decoder of the present invention, the frame is regarded as the plosive frame if this combination occurs. Then, the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bit-stream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP.
- While preferred embodiments of the invention have been disclosed in detail in the foregoing description and drawings, it will be understood by those skilled in the art that variations and modifications thereof can be made without departing from the spirit and scope of the invention as set forth in the following claims.
Claims (18)
1. A method of enhancing the speech quality of a speech coder comprising the steps of:
digitally sampling speech to create a speech waveform over a multiplicity of frames;
using a sliding-sample window to locate a frame position with the highest pitch correlation; and
formulating at least one synthesized voice parameter in response to the speech waveform within the located frame position.
2. The method of claim 1 , wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch over multiple frame positions defined by the sliding-sample window.
3. The method of claim 1 , wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for a fixed-length sliding-sample window.
4. The method of claim 1 , wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for up to a predetermined number of frames.
5. The method of claim 1 , wherein the step of formulating comprises estimating a frame pitch in response to the signal contained within the located frame position.
6. The method of claim 5 , further comprising the step of:
estimating linear predictive coding (LPC) coefficients in response to the signal contained within the located frame position.
7. The method of claim 5 , further comprising the step of:
estimating gain in response to the signal contained within the located frame position.
8. The method of claim 5 , further comprising the step of:
estimating a voicing decision in response to the signal contained within the located frame position.
9. The method of claim 5 , further comprising the step of:
estimating a fractional pitch in response to the signal contained within the located frame position.
10. A speech coder comprising:
means for sampling a speech waveform to generate a discrete representation of the speech waveform over a multiplicity of frames; and
means for locating a pitch-analysis window over that frame position with the highest pitch correlation.
11. The coder of claim 10 , wherein the means for locating a frame position with the highest pitch correlation compares pitch analysis results associated with multiple frames.
12. The coder of claim 10 , wherein the means for locating a frame position with the highest pitch correlation performs a recurrence calculation on the autocorrelation of the pitch for multiple frame positions defined by the sliding-sample window.
13. The coder of claim 10 , wherein the means for locating a pitch-analysis window comprises a fixed-length window.
14. The coder of claim 13 , wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of pitch results from multiple frame positions defined by the fixed-length window.
15. The coder of claim 10 , wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch from up to a predetermined number of frames defined by the sliding-sample window.
16. The coder of claim 10 , further comprising:
means for estimating a plurality of speech parameters in response to the signal contained within the located frame position.
17. The coder of claim 16 , wherein the means for estimating comprises at least one digital signal processor in the mixed-excitation linear predictive (MELP) coder.
18. The coder of claim 16 , wherein the means for estimating comprises at least one algorithm stored within the mixed-excitation linear predictive (MELP) coder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/991,387 US20020052734A1 (en) | 1999-02-04 | 2001-11-16 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11864499P | 1999-02-04 | 1999-02-04 | |
US09/408,195 US6453287B1 (en) | 1999-02-04 | 1999-09-29 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
US09/991,387 US20020052734A1 (en) | 1999-02-04 | 2001-11-16 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/408,195 Division US6453287B1 (en) | 1999-02-04 | 1999-09-29 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
Publications (1)
Publication Number | Publication Date |
---|---|
US20020052734A1 true US20020052734A1 (en) | 2002-05-02 |
Family
ID=26816592
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/408,195 Expired - Fee Related US6453287B1 (en) | 1999-02-04 | 1999-09-29 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
US09/991,387 Abandoned US20020052734A1 (en) | 1999-02-04 | 2001-11-16 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/408,195 Expired - Fee Related US6453287B1 (en) | 1999-02-04 | 1999-09-29 | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders |
Country Status (1)
Country | Link |
---|---|
US (2) | US6453287B1 (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030220783A1 (en) * | 2002-03-12 | 2003-11-27 | Sebastian Streich | Efficiency improvements in scalable audio coding |
US20050131680A1 (en) * | 2002-09-13 | 2005-06-16 | International Business Machines Corporation | Speech synthesis using complex spectral modeling |
US6968309B1 (en) * | 2000-10-31 | 2005-11-22 | Nokia Mobile Phones Ltd. | Method and system for speech frame error concealment in speech decoding |
US20070055502A1 (en) * | 2005-02-15 | 2007-03-08 | Bbn Technologies Corp. | Speech analyzing system with speech codebook |
US20070094009A1 (en) * | 2005-10-26 | 2007-04-26 | Ryu Sang-Uk | Encoder-assisted frame loss concealment techniques for audio coding |
US20070299659A1 (en) * | 2006-06-21 | 2007-12-27 | Harris Corporation | Vocoder and associated method that transcodes between mixed excitation linear prediction (melp) vocoders with different speech frame rates |
US20080069364A1 (en) * | 2006-09-20 | 2008-03-20 | Fujitsu Limited | Sound signal processing method, sound signal processing apparatus and computer program |
US20090024398A1 (en) * | 2006-09-12 | 2009-01-22 | Motorola, Inc. | Apparatus and method for low complexity combinatorial coding of signals |
US20090100121A1 (en) * | 2007-10-11 | 2009-04-16 | Motorola, Inc. | Apparatus and method for low complexity combinatorial coding of signals |
US20090112607A1 (en) * | 2007-10-25 | 2009-04-30 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within an audio coding system |
US20090234642A1 (en) * | 2008-03-13 | 2009-09-17 | Motorola, Inc. | Method and Apparatus for Low Complexity Combinatorial Coding of Signals |
US20090259477A1 (en) * | 2008-04-09 | 2009-10-15 | Motorola, Inc. | Method and Apparatus for Selective Signal Coding Based on Core Encoder Performance |
US20100106509A1 (en) * | 2007-06-27 | 2010-04-29 | Osamu Shimada | Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system |
US20100169100A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Selective scaling mask computation based on peak detection |
US20100169099A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system |
US20100169087A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Selective scaling mask computation based on peak detection |
US20100169101A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system |
US20110156932A1 (en) * | 2009-12-31 | 2011-06-30 | Motorola | Hybrid arithmetic-combinatorial encoder |
US20110218799A1 (en) * | 2010-03-05 | 2011-09-08 | Motorola, Inc. | Decoder for audio signal including generic audio and speech frames |
US20110218797A1 (en) * | 2010-03-05 | 2011-09-08 | Motorola, Inc. | Encoder for audio signal including generic audio and speech frames |
US9129600B2 (en) | 2012-09-26 | 2015-09-08 | Google Technology Holdings LLC | Method and apparatus for encoding an audio signal |
CN105118513A (en) * | 2015-07-22 | 2015-12-02 | 重庆邮电大学 | 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP |
US9245538B1 (en) * | 2010-05-20 | 2016-01-26 | Audience, Inc. | Bandwidth enhancement of speech signals assisted by noise reduction |
US9343056B1 (en) | 2010-04-27 | 2016-05-17 | Knowles Electronics, Llc | Wind noise detection and suppression |
US20160203826A1 (en) * | 2013-07-12 | 2016-07-14 | Orange | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US9431023B2 (en) | 2010-07-12 | 2016-08-30 | Knowles Electronics, Llc | Monaural noise suppression based on computational auditory scene analysis |
US9438992B2 (en) | 2010-04-29 | 2016-09-06 | Knowles Electronics, Llc | Multi-microphone robust noise suppression |
US9502048B2 (en) | 2010-04-19 | 2016-11-22 | Knowles Electronics, Llc | Adaptively reducing noise to limit speech distortion |
US9699554B1 (en) | 2010-04-21 | 2017-07-04 | Knowles Electronics, Llc | Adaptive signal equalization |
WO2019142513A1 (en) * | 2018-01-17 | 2019-07-25 | 日本電信電話株式会社 | Encoding device, decoding device, fricative determination device, and method and program thereof |
WO2019142514A1 (en) * | 2018-01-17 | 2019-07-25 | 日本電信電話株式会社 | Decoding device, encoding device, method and program thereof |
CN110503966A (en) * | 2019-09-06 | 2019-11-26 | 成都理工大学 | MELP/CELP mixing voice navamander and coding method based on rail |
CN110610713A (en) * | 2019-08-28 | 2019-12-24 | 南京梧桐微电子科技有限公司 | Vocoder residue spectrum amplitude parameter reconstruction method and system |
US11587573B2 (en) | 2019-09-17 | 2023-02-21 | Acer Incorporated | Speech processing method and device thereof |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPQ366799A0 (en) * | 1999-10-26 | 1999-11-18 | University Of Melbourne, The | Emphasis of short-duration transient speech features |
US6963833B1 (en) * | 1999-10-26 | 2005-11-08 | Sasken Communication Technologies Limited | Modifications in the multi-band excitation (MBE) model for generating high quality speech at low bit rates |
WO2001059766A1 (en) * | 2000-02-11 | 2001-08-16 | Comsat Corporation | Background noise reduction in sinusoidal based speech coding systems |
US6910007B2 (en) * | 2000-05-31 | 2005-06-21 | At&T Corp | Stochastic modeling of spectral adjustment for high quality pitch modification |
EP1203369B1 (en) * | 2000-06-20 | 2005-08-31 | Koninklijke Philips Electronics N.V. | Sinusoidal coding |
US6947888B1 (en) * | 2000-10-17 | 2005-09-20 | Qualcomm Incorporated | Method and apparatus for high performance low bit-rate coding of unvoiced speech |
US20030028386A1 (en) * | 2001-04-02 | 2003-02-06 | Zinser Richard L. | Compressed domain universal transcoder |
US6789058B2 (en) * | 2002-10-15 | 2004-09-07 | Mindspeed Technologies, Inc. | Complexity resource manager for multi-channel speech processing |
US7310597B2 (en) * | 2003-01-31 | 2007-12-18 | Harris Corporation | System and method for enhancing bit error tolerance over a bandwidth limited channel |
US20040172307A1 (en) * | 2003-02-06 | 2004-09-02 | Gruber Martin A. | Electronic medical record method |
WO2004084181A2 (en) * | 2003-03-15 | 2004-09-30 | Mindspeed Technologies, Inc. | Simple noise suppression model |
US7596488B2 (en) * | 2003-09-15 | 2009-09-29 | Microsoft Corporation | System and method for real-time jitter control and packet-loss concealment in an audio signal |
US20050091044A1 (en) * | 2003-10-23 | 2005-04-28 | Nokia Corporation | Method and system for pitch contour quantization in audio coding |
US20050091041A1 (en) * | 2003-10-23 | 2005-04-28 | Nokia Corporation | Method and system for speech coding |
US20070118364A1 (en) * | 2005-11-23 | 2007-05-24 | Wise Gerald B | System for generating closed captions |
US20070118372A1 (en) * | 2005-11-23 | 2007-05-24 | General Electric Company | System and method for generating closed captions |
US20080195381A1 (en) * | 2007-02-09 | 2008-08-14 | Microsoft Corporation | Line Spectrum pair density modeling for speech applications |
US8688441B2 (en) * | 2007-11-29 | 2014-04-01 | Motorola Mobility Llc | Method and apparatus to facilitate provision and use of an energy value to determine a spectral envelope shape for out-of-signal bandwidth content |
US8433582B2 (en) * | 2008-02-01 | 2013-04-30 | Motorola Mobility Llc | Method and apparatus for estimating high-band energy in a bandwidth extension system |
US20090201983A1 (en) * | 2008-02-07 | 2009-08-13 | Motorola, Inc. | Method and apparatus for estimating high-band energy in a bandwidth extension system |
US8463412B2 (en) * | 2008-08-21 | 2013-06-11 | Motorola Mobility Llc | Method and apparatus to facilitate determining signal bounding frequencies |
CN101599272B (en) * | 2008-12-30 | 2011-06-08 | 华为技术有限公司 | Keynote searching method and device thereof |
US8463599B2 (en) * | 2009-02-04 | 2013-06-11 | Motorola Mobility Llc | Bandwidth extension method and apparatus for a modified discrete cosine transform audio coder |
US8340965B2 (en) * | 2009-09-02 | 2012-12-25 | Microsoft Corporation | Rich context modeling for text-to-speech engines |
US8594993B2 (en) | 2011-04-04 | 2013-11-26 | Microsoft Corporation | Frame mapping approach for cross-lingual voice transformation |
WO2013019562A2 (en) * | 2011-07-29 | 2013-02-07 | Dts Llc. | Adaptive voice intelligibility processor |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3836717A (en) * | 1971-03-01 | 1974-09-17 | Scitronix Corp | Speech synthesizer responsive to a digital command input |
US4618985A (en) * | 1982-06-24 | 1986-10-21 | Pfeiffer J David | Speech synthesizer |
US4771465A (en) * | 1986-09-11 | 1988-09-13 | American Telephone And Telegraph Company, At&T Bell Laboratories | Digital speech sinusoidal vocoder with transmission of only subset of harmonics |
US5278943A (en) * | 1990-03-23 | 1994-01-11 | Bright Star Technology, Inc. | Speech animation and inflection system |
US5839102A (en) * | 1994-11-30 | 1998-11-17 | Lucent Technologies Inc. | Speech coding parameter sequence reconstruction by sequence classification and interpolation |
WO1999010719A1 (en) * | 1997-08-29 | 1999-03-04 | The Regents Of The University Of California | Method and apparatus for hybrid coding of speech at 4kbps |
US6304842B1 (en) * | 1999-06-30 | 2001-10-16 | Glenayre Electronics, Inc. | Location and coding of unvoiced plosives in linear predictive coding of speech |
-
1999
- 1999-09-29 US US09/408,195 patent/US6453287B1/en not_active Expired - Fee Related
-
2001
- 2001-11-16 US US09/991,387 patent/US20020052734A1/en not_active Abandoned
Cited By (64)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6968309B1 (en) * | 2000-10-31 | 2005-11-22 | Nokia Mobile Phones Ltd. | Method and system for speech frame error concealment in speech decoding |
US20030220783A1 (en) * | 2002-03-12 | 2003-11-27 | Sebastian Streich | Efficiency improvements in scalable audio coding |
US7277849B2 (en) * | 2002-03-12 | 2007-10-02 | Nokia Corporation | Efficiency improvements in scalable audio coding |
US20050131680A1 (en) * | 2002-09-13 | 2005-06-16 | International Business Machines Corporation | Speech synthesis using complex spectral modeling |
US8280724B2 (en) * | 2002-09-13 | 2012-10-02 | Nuance Communications, Inc. | Speech synthesis using complex spectral modeling |
US20070055502A1 (en) * | 2005-02-15 | 2007-03-08 | Bbn Technologies Corp. | Speech analyzing system with speech codebook |
US8219391B2 (en) * | 2005-02-15 | 2012-07-10 | Raytheon Bbn Technologies Corp. | Speech analyzing system with speech codebook |
US20070094009A1 (en) * | 2005-10-26 | 2007-04-26 | Ryu Sang-Uk | Encoder-assisted frame loss concealment techniques for audio coding |
US8620644B2 (en) * | 2005-10-26 | 2013-12-31 | Qualcomm Incorporated | Encoder-assisted frame loss concealment techniques for audio coding |
US20070299659A1 (en) * | 2006-06-21 | 2007-12-27 | Harris Corporation | Vocoder and associated method that transcodes between mixed excitation linear prediction (melp) vocoders with different speech frame rates |
US8589151B2 (en) * | 2006-06-21 | 2013-11-19 | Harris Corporation | Vocoder and associated method that transcodes between mixed excitation linear prediction (MELP) vocoders with different speech frame rates |
US20090024398A1 (en) * | 2006-09-12 | 2009-01-22 | Motorola, Inc. | Apparatus and method for low complexity combinatorial coding of signals |
US8495115B2 (en) | 2006-09-12 | 2013-07-23 | Motorola Mobility Llc | Apparatus and method for low complexity combinatorial coding of signals |
US9256579B2 (en) | 2006-09-12 | 2016-02-09 | Google Technology Holdings LLC | Apparatus and method for low complexity combinatorial coding of signals |
US20080069364A1 (en) * | 2006-09-20 | 2008-03-20 | Fujitsu Limited | Sound signal processing method, sound signal processing apparatus and computer program |
US20100106509A1 (en) * | 2007-06-27 | 2010-04-29 | Osamu Shimada | Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system |
US8788264B2 (en) * | 2007-06-27 | 2014-07-22 | Nec Corporation | Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system |
US20090100121A1 (en) * | 2007-10-11 | 2009-04-16 | Motorola, Inc. | Apparatus and method for low complexity combinatorial coding of signals |
US8576096B2 (en) | 2007-10-11 | 2013-11-05 | Motorola Mobility Llc | Apparatus and method for low complexity combinatorial coding of signals |
US20090112607A1 (en) * | 2007-10-25 | 2009-04-30 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within an audio coding system |
US8209190B2 (en) | 2007-10-25 | 2012-06-26 | Motorola Mobility, Inc. | Method and apparatus for generating an enhancement layer within an audio coding system |
US20090234642A1 (en) * | 2008-03-13 | 2009-09-17 | Motorola, Inc. | Method and Apparatus for Low Complexity Combinatorial Coding of Signals |
US8639519B2 (en) | 2008-04-09 | 2014-01-28 | Motorola Mobility Llc | Method and apparatus for selective signal coding based on core encoder performance |
US20090259477A1 (en) * | 2008-04-09 | 2009-10-15 | Motorola, Inc. | Method and Apparatus for Selective Signal Coding Based on Core Encoder Performance |
US20100169087A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Selective scaling mask computation based on peak detection |
US8219408B2 (en) | 2008-12-29 | 2012-07-10 | Motorola Mobility, Inc. | Audio signal decoder and method for producing a scaled reconstructed audio signal |
US8175888B2 (en) | 2008-12-29 | 2012-05-08 | Motorola Mobility, Inc. | Enhanced layered gain factor balancing within a multiple-channel audio coding system |
US8340976B2 (en) | 2008-12-29 | 2012-12-25 | Motorola Mobility Llc | Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system |
US8200496B2 (en) | 2008-12-29 | 2012-06-12 | Motorola Mobility, Inc. | Audio signal decoder and method for producing a scaled reconstructed audio signal |
US20100169100A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Selective scaling mask computation based on peak detection |
US8140342B2 (en) | 2008-12-29 | 2012-03-20 | Motorola Mobility, Inc. | Selective scaling mask computation based on peak detection |
US20100169099A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system |
US20100169101A1 (en) * | 2008-12-29 | 2010-07-01 | Motorola, Inc. | Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system |
US20110156932A1 (en) * | 2009-12-31 | 2011-06-30 | Motorola | Hybrid arithmetic-combinatorial encoder |
US8149144B2 (en) | 2009-12-31 | 2012-04-03 | Motorola Mobility, Inc. | Hybrid arithmetic-combinatorial encoder |
US8423355B2 (en) | 2010-03-05 | 2013-04-16 | Motorola Mobility Llc | Encoder for audio signal including generic audio and speech frames |
US20110218799A1 (en) * | 2010-03-05 | 2011-09-08 | Motorola, Inc. | Decoder for audio signal including generic audio and speech frames |
US20110218797A1 (en) * | 2010-03-05 | 2011-09-08 | Motorola, Inc. | Encoder for audio signal including generic audio and speech frames |
US8428936B2 (en) | 2010-03-05 | 2013-04-23 | Motorola Mobility Llc | Decoder for audio signal including generic audio and speech frames |
US9502048B2 (en) | 2010-04-19 | 2016-11-22 | Knowles Electronics, Llc | Adaptively reducing noise to limit speech distortion |
US9699554B1 (en) | 2010-04-21 | 2017-07-04 | Knowles Electronics, Llc | Adaptive signal equalization |
US9343056B1 (en) | 2010-04-27 | 2016-05-17 | Knowles Electronics, Llc | Wind noise detection and suppression |
US9438992B2 (en) | 2010-04-29 | 2016-09-06 | Knowles Electronics, Llc | Multi-microphone robust noise suppression |
US9245538B1 (en) * | 2010-05-20 | 2016-01-26 | Audience, Inc. | Bandwidth enhancement of speech signals assisted by noise reduction |
US9431023B2 (en) | 2010-07-12 | 2016-08-30 | Knowles Electronics, Llc | Monaural noise suppression based on computational auditory scene analysis |
US9129600B2 (en) | 2012-09-26 | 2015-09-08 | Google Technology Holdings LLC | Method and apparatus for encoding an audio signal |
US20160203826A1 (en) * | 2013-07-12 | 2016-07-14 | Orange | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10672412B2 (en) | 2013-07-12 | 2020-06-02 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US20180018982A1 (en) * | 2013-07-12 | 2018-01-18 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US20180082699A1 (en) * | 2013-07-12 | 2018-03-22 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10943594B2 (en) | 2013-07-12 | 2021-03-09 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10438599B2 (en) * | 2013-07-12 | 2019-10-08 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10438600B2 (en) * | 2013-07-12 | 2019-10-08 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10446163B2 (en) * | 2013-07-12 | 2019-10-15 | Koniniklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10943593B2 (en) | 2013-07-12 | 2021-03-09 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
US10783895B2 (en) | 2013-07-12 | 2020-09-22 | Koninklijke Philips N.V. | Optimized scale factor for frequency band extension in an audio frequency signal decoder |
CN105118513B (en) * | 2015-07-22 | 2018-12-28 | 重庆邮电大学 | A kind of 1.2kb/s low bit rate speech coding method based on mixed excitation linear prediction MELP |
CN105118513A (en) * | 2015-07-22 | 2015-12-02 | 重庆邮电大学 | 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP |
WO2019142513A1 (en) * | 2018-01-17 | 2019-07-25 | 日本電信電話株式会社 | Encoding device, decoding device, fricative determination device, and method and program thereof |
CN111602197A (en) * | 2018-01-17 | 2020-08-28 | 日本电信电话株式会社 | Decoding device, encoding device, methods thereof, and program |
WO2019142514A1 (en) * | 2018-01-17 | 2019-07-25 | 日本電信電話株式会社 | Decoding device, encoding device, method and program thereof |
CN110610713A (en) * | 2019-08-28 | 2019-12-24 | 南京梧桐微电子科技有限公司 | Vocoder residue spectrum amplitude parameter reconstruction method and system |
CN110503966A (en) * | 2019-09-06 | 2019-11-26 | 成都理工大学 | MELP/CELP mixing voice navamander and coding method based on rail |
US11587573B2 (en) | 2019-09-17 | 2023-02-21 | Acer Incorporated | Speech processing method and device thereof |
Also Published As
Publication number | Publication date |
---|---|
US6453287B1 (en) | 2002-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6453287B1 (en) | Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders | |
Supplee et al. | MELP: the new federal standard at 2400 bps | |
KR100388388B1 (en) | Method and apparatus for synthesizing speech using regerated phase information | |
KR100389178B1 (en) | Voice/unvoiced classification of speech for use in speech decoding during frame erasures | |
KR101032119B1 (en) | Method and device for efficient frame erasure concealment in linear predictive based speech codecs | |
KR100389179B1 (en) | Pitch delay modification during frame erasures | |
US5754974A (en) | Spectral magnitude representation for multi-band excitation speech coders | |
JP5149198B2 (en) | Method and device for efficient frame erasure concealment within a speech codec | |
US8315860B2 (en) | Interoperable vocoder | |
US6931373B1 (en) | Prototype waveform phase modeling for a frequency domain interpolative speech codec system | |
KR100433608B1 (en) | Improved adaptive codebook-based speech compression system | |
US7286982B2 (en) | LPC-harmonic vocoder with superframe structure | |
US7013269B1 (en) | Voicing measure for a speech CODEC system | |
US6996523B1 (en) | Prototype waveform magnitude quantization for a frequency domain interpolative speech codec system | |
EP0673013B1 (en) | Signal encoding and decoding system | |
EP1598811B1 (en) | Decoding apparatus and method | |
KR20020052191A (en) | Variable bit-rate celp coding of speech with phonetic classification | |
JPH08328591A (en) | Method for adaptation of noise masking level to synthetic analytical voice coder using short-term perception weightingfilter | |
JP4040126B2 (en) | Speech decoding method and apparatus | |
JP2004310088A (en) | Half-rate vocoder | |
US20130246055A1 (en) | System and Method for Post Excitation Enhancement for Low Bit Rate Speech Coding | |
EP0747884B1 (en) | Codebook gain attenuation during frame erasures | |
Wang et al. | Phonetic segmentation for low rate speech coding | |
KR100220783B1 (en) | Speech quantization and error correction method | |
Stegmann et al. | CELP coding based on signal classification using the dyadic wavelet transform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GEORGIA TECH RESEARCH CORPORATION, GEORGIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:UNNO, TAKAHIRO;BARNWELL III, THOMAS P.;TRUONG, KWAN K.;REEL/FRAME:012323/0384;SIGNING DATES FROM 19990923 TO 19990926 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |