US20020052734A1 - Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders - Google Patents

Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders Download PDF

Info

Publication number
US20020052734A1
US20020052734A1 US09/991,387 US99138701A US2002052734A1 US 20020052734 A1 US20020052734 A1 US 20020052734A1 US 99138701 A US99138701 A US 99138701A US 2002052734 A1 US2002052734 A1 US 2002052734A1
Authority
US
United States
Prior art keywords
pitch
coder
speech
frame position
plosive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US09/991,387
Inventor
Takahiro Unno
Thomas Barnwell
Kwan Truong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Georgia Tech Research Corp
Original Assignee
Georgia Tech Research Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Georgia Tech Research Corp filed Critical Georgia Tech Research Corp
Priority to US09/991,387 priority Critical patent/US20020052734A1/en
Assigned to GEORGIA TECH RESEARCH CORPORATION reassignment GEORGIA TECH RESEARCH CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BARNWELL III, THOMAS P., TRUONG, KWAN K., UNNO, TAKAHIRO
Publication of US20020052734A1 publication Critical patent/US20020052734A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation

Definitions

  • the present invention relates to speech signal coding using a parametric coder to model a speech waveform.
  • the speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder.
  • MELP mixed excitation linear predictive
  • FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a one-way communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system.
  • the first widely-used low bit-rate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an all-pole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy.
  • LPC linear predictive coding
  • the LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 20-30 ms and transmitted to the communications channel.
  • the artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise.
  • the speech quality is very poor in the presence of background noise.
  • the mixed excitation linear predictive (MELP) coder is one of these speech coders.
  • the MELP coder is a linear-prediction-based speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder.
  • FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively.
  • the MELP still has some perceivable distortions, particularly around the non-stationary speech segments and for some low-pitch male speakers. These distortions can also be observed with other low bit-rate speech coders.
  • the distortion around the non-stationary speech segments results from the update of speech parameters at a low frame rate (typically 30-50 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate.
  • Another possible solution is a variable frame-rate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments.
  • Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce high-quality speech while maintaining a relatively low bit rate by reducing the average frame rate.
  • this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as two-way voice communications).
  • the distortion for low-pitch male speakers in the MELP is characterized by a high-pass filtered quality of the coded speech.
  • the synthesized speech lacks “sound pressure” in the low frequencies.
  • This distortion is caused by a post filter and a preprocessing high-pass filter, which are used in the modern low bit-rate speech coders to remove 60 Hz noise and to enhance the coded speech quality.
  • These filters suppress the harmonic magnitudes in the low frequencies, particularly for low-pitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a high-pass filtered speech that is perceived as too synthetic.
  • Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds.
  • Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”).
  • Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bit-rate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear.
  • an object of the present invention is to enhance the coded speech quality of the existing low bit-rate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity.
  • Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder.
  • Another object of the present invention is to provide bit-stream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated.
  • the present invention provides four embodiments.
  • the first is a robust pitch-detection algorithm.
  • the fixed-length pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixed-position windows in the prior art.
  • the second embodiment is a plosive analysis/synthesis method.
  • the system first detects the frame that contains the plosive signal.
  • the plosive detection is performed with sliding-window peakiness analysis.
  • the detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder.
  • the plosive signal is synthesized independently and added back to the coded speech.
  • the third embodiment is a post-processor for the Fourier magnitude model.
  • the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the high-pass filtered speech of current low bit-rate speech encoders.
  • the fourth embodiment is a new mixed-excitation algorithm.
  • a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the band-pass filtering operations, which are required to generate the mixed-excitation signal in the existing MELP coder.
  • the elimination of the filters results in a significant reduction of computational complexity in the MELP decoder.
  • the present system is shown to be compatible in terms of bit-stream and is interchangeable with the coder/decoder of the existing MELP speech coder.
  • FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder
  • FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A;
  • FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A;
  • FIG. 2A is a block diagram highlighting the new embodiments of the present system
  • FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A;
  • FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A;
  • FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art
  • FIG. 3B illustrates plosive signal synthesis in coded speech
  • FIG. 3C illustrates a typical LPC residual waveform for a plosive signal
  • FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model
  • FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model
  • FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B;
  • FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C;
  • FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5;
  • FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5;
  • FIG. 8 illustrates a block diagram of the post-processor for the Fourier magnitude of the present invention as shown in FIG. 2C;
  • FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C;
  • FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames
  • FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames.
  • FIG. 12 illustrates words with plosive sounds
  • FIG. 13 illustrates the replacement of different plosive types in the present invention
  • FIG. 14 reveals the bit allocation for the plosive signal model
  • FIG. 15 reveals the 99-level Pitch and voicingng level quantization in the existing MELP
  • FIG. 16A reveals the bit allocation in the existing MELP frame
  • FIG. 16B reveals the bit transmission order in the existing MELP frame.
  • the present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bit-rate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention.
  • the present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as 52 in FIG. 2A. The robust pitch detector 52 replaces a portion of the refinement of pitch and voicing decision 37 in the MELP coder and does not require additional bits for transmission.
  • FIG. 2A The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A.
  • Plosive analysis 55 is added to the encoder.
  • Plosive synthesis 59 is added to the decoder and requires two bits for transmission.
  • the third embodiment a post-processor for the Fourier magnitude 62 , is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission.
  • the fourth embodiment a new mixed excitation 35 , is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art.
  • the new mixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission.
  • FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder.
  • a frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second.
  • the MELP is a parametric speech coder that creates a 54-bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Band-pass voicingng, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit.
  • Input speech is encoded as follows. First, the input speech signal is processed through high-pass filter 11 with a cut-off frequency of 60 Hz to remove low-frequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations.
  • the speech signal is band-passed filtered into 5 frequency bands from 0-500, 500-100, 1000-2000, 2000-3000, and 3000-4000 Hz for voicing analysis.
  • An initial pitch estimation is made using the 0-500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame.
  • the initial pitch estimation from the first band-pass filter is used as the initial reference point for robust pitch detector 52 (FIG. 2B).
  • the band-pass voicing strength is determined using the pitch determined by the robust pitch detector 52 described below.
  • the time envelopes of each of the band-pass filters are calculated by full-wave rectification followed by a smoothing filter.
  • the analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band.
  • S k is the kth sample in the fixed-position window
  • s O is the signal at the center of the fixed-position window
  • T is a pitch lag
  • N is the number of samples accumulated for the correlation computation.
  • the binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or ill-defined pitch.
  • noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions.
  • a sliding-sample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis.
  • the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the non-periodically stable signal segments.
  • FIG. 4 shows a robust pitch detector used in the present invention.
  • the normalized pitch correlation in the window 43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, S k is the k th signal and s 0 is the signal at the center of the original fixed-position window.
  • Ns is the maximum window-sliding range from the original fixed-position window.
  • an LPC parameter, a gain, band-pass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation.
  • Equation (2) solving for r i (T) for all values of i would result in a significant increase in the computational complexity.
  • the recursion Equation (2) for c T (i, j) is used to compute the autocorrelation.
  • the aperiodic flag is set if V bpl , determined in the voicing analysis for the 0 to 500 Hz band-pass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.
  • a 10 th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame.
  • a traditional autocorrelation analysis procedure is implemented using Levinson-Durbin recursion.
  • a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.
  • a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal.
  • the plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis.
  • FIG. 5 shows the plosive analysis/synthesis system.
  • the plosive detector 56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal.
  • the peakiness value is sensitive to the phase of the plosive signal.
  • r n is a LPC residual signal and N is a frame size.
  • the plosive detector slides the peakiness analysis window 63 to find the maximum peakiness value in all windows.
  • 1 N ⁇ B i 1 N ⁇ A i , Eq . ⁇ ( 5 )
  • a l A i ⁇ 1 +
  • r N ⁇ 1 i
  • N s is the maximum window-sliding range, which is also used for the pitch detector of the present invention.
  • the peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform.
  • the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.
  • FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency.
  • FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention.
  • g p is the scaling factor based on the energy of the input plosive signal
  • a i are the LPC coefficients computed from the input plosive signal.
  • the template plosive signal v(n) was chosen arbitrarily and filtered with the 14 th order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10 th order LPC model used for voiced pitch modeling is used for the production of the plosive signal.
  • the parameters for transmission are a plosive flag, a plosive location, and plosive gain.
  • the gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal.
  • the gain is quantized with two bits.
  • the position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame.
  • the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14.
  • plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder.
  • the input speech signal gain is measured twice per frame using a pitch adaptive window length.
  • This adaptive length is identical for both gain measurements and is determined as follows.
  • V bp1 >0.6 the length is the shortest multiple of P 2 which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2.
  • V bpl is less than or equal to 0.6, the window length is 120 samples.
  • the gain calculation for the first window produces G 1 and is centered 90 samples before the last sample of the current frame.
  • the calculation for the second window produces G 2 and is centered on the last sample of the current frame.
  • L is the window length.
  • the 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is ⁇ 32768 to 32767.
  • the encoder performs a quantization of the LPC coefficients.
  • the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation.
  • the resulting LSF vector f is quantized using a multi-stage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder.
  • the final pitch value, P 3 is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if V bpl is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.
  • the two gain values are quantized as follows.
  • G 2 is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB.
  • G 1 is quantized to 3 bits using the following adaptive algorithm. If G 2 for the current frame is within 5 dB of G 2 for the previous frame, and G 1 is within 3 dB of the average of G 2 values for the current and previous frames, then the frame is steady-state and a code of all zeros is sent to indicate that the decoder should set G 1 to the mean of G 2 values for the current and previous frames. Otherwise, the frame represents a transition and G 1 is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G 2 values for the current and previous frames to 6 dB above the maximum of those G 2 values.
  • Fourier Magnitude calculation and quantization occurs as follows.
  • FFT Fast Fourier Transform
  • a set of quantized predictor coefficients are calculated from the quantized LSF vector.
  • the residual window is generated using the quantized prediction coefficients.
  • a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed.
  • the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peak-selecting algorithm.
  • the peak-selecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer.
  • the initial estimate for the location of the i th harmonic is 512 i/P.
  • the number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.
  • the 10 magnitudes are quantized with an 8-bit quantizer.
  • the codebook is searched for a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies.
  • FIG. 12 shows the bit allocation for the MELP coder.
  • the unused coder parameters for the unvoiced mode are replaced with forward error correction.
  • Three Hamming (7,4) codes and one Hamming (8,4) code may be used.
  • the (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors.
  • the (8,4) code is applied to the 4 most significant bits (MSBs) of the first multi-stage vector quantization index, and the 4 parity bits are written over the band-pass voicing.
  • MSBs most significant bits
  • the remaining three bits of the first multi-stage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index.
  • the 4 MSBs of the G 2 codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes.
  • the least significant bit (LSB) of the second gain index and the 3 bit G 1 codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit.
  • FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame.
  • FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes.
  • the received bit stream is unpacked from the communications channel 18 and assembled into the parametric codewords.
  • Parameter decoding differs for the voiced and unvoiced frames.
  • Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used.
  • the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors.
  • the remaining parameters are decoded.
  • the LSFs are checked for ascending order and a minimum separation of 50 Hz.
  • default parameter values are used for the pitch, jitter, band-pass voicing, and Fourier magnitudes.
  • the pitch value is set to 50 samples, the jitter is set to 25%, the band-pass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0.
  • V bpl is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%.
  • the band-pass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.
  • Gain, G 1 is then modified by subtracting a positive correction term, G att , given in dB by:
  • All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period.
  • the interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral-tilt coefficient for the adaptive spectral-enhancement filter.
  • the other parameters are linearly interpolated between the past and current frame values.
  • the interpolation factor, int, for these parameters is based on the starting point of the new pitch period:
  • G int is the interpolated gain. This interpolation factor is then clamped between 0 and 1.
  • FIG. 9 shows a new mixed-excitation algorithm in the present invention.
  • the existing MELP uses the Fourier magnitudes to generate a pulse train.
  • the pulse train is mixed with random noise in time domain by band-pass filtering.
  • noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes.
  • Block 64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands.
  • cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitch-synchronously, the band-pass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced).
  • the adaptive spectral enhancement filter is then applied to the mixed excitation signal.
  • This filter is a 10 th order pole/zero filter with additional first order tilt compensation.
  • the coefficients are generated by bandwidth expansion of the LPC filter transfer function A(z), corresponding to the interpolated LSFs.
  • tilt coefficient, ⁇ is first calculated as max(0.5k l 0), then interpolated and multiplied by p, the signal probability.
  • the first reflection coefficient, k l is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, k l , is usually negative for the voiced spectra.
  • Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter.
  • this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period.
  • the pulse dispersion filter is a 65 th order FIR filter derived from a spectrally flattened triangular pulse.
  • the coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.
  • a post-processor for the Fourier magnitude model 62 is added to the MELP decoder as shown in FIG. 2A.
  • the preprocessing high-pass filter 11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF) 30 in FIG. 2C It was found that this effect led to a high-pass filtered quality for low-pitch male speakers.
  • the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters.
  • ⁇ S ⁇ ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) ⁇ ⁇ S ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) ⁇ ⁇ G H ⁇ ( ⁇ j ⁇ ⁇ ⁇ i ) , Eq . ⁇ ( 21 )
  • H 1 (e j ⁇ ) and H 2 (e j ⁇ ) are the magnitude responses of the ASEF 30 and preprocessing high-pass filter 11 respectively.
  • the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame.
  • FIG. 7 shows the block diagram of the plosive synthesis 66 .
  • all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosive residual template 71 , which is pre-stored in the synthesizer.
  • This plosive residual template 71 was chosen arbitrarily and filtered with the 14 th order LPC inverse filter.
  • the LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis.
  • the gain of synthesized plosive signal is adjusted by applying plosive gain 76 to the MELP gain 34 .
  • the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown in block 73 .
  • the gain of the coded speech is adjusted in gain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame:
  • Another advantage of the present invention is bit-stream compatibility with the existing MELP coder.
  • the present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a post-processor for the Fourier magnitude model and a new mixed-excitation algorithm.
  • a robust pitch detector As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission.
  • the additional bits for the plosive can be packed into the bit-stream of the existing MELP.
  • the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced.
  • the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced.
  • a frame that contains a plosive is assumed to be a unvoiced frame.
  • FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy.
  • FIG. 11 shows the bit unpacking flow diagram for the plosive signal.
  • the decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP.
  • the frame is regarded as the plosive frame if this combination occurs.
  • the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bit-stream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP.

Abstract

A system and method for enhancing the speech quality of the mixed-excitation linear predictive (MELP) coder and other low bit-rate speech coders are disclosed. The system includes a robust pitch-detection algorithm, which adjusts or slides a pitch-analysis window to provide the speech coder with more reliable pitch information. In addition, the system is shown to be compatible with the existing MELP coder in terms of the bit stream.

Description

    CLAIM OF PRIORITY
  • This application is a divisional application of a co-pending U.S. Utility Application, entitled, “Apparatus and Quality Enhancement Algorithm for Mixed Excitation Linear Predictive (MELP) and Other Speech Coders,” to Unno et al., filed Sep. 29, 1999, granted Ser. No. 09/408,195, which is incorporated herein by reference in its entirety.[0001]
  • FIELD OF THE INVENTION
  • The present invention relates to speech signal coding using a parametric coder to model a speech waveform. The speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder. [0002]
  • BACKGROUND OF THE INVENTION
  • Low bit-rate speech coding technology is widely used for digital voice communication in narrow-bandwidth channels. The common objective of this technology is to transfer the digital speech signal information at a low bit rate (typically 2,400 bits/sec) while providing good quality speech synthesis at the destination. This technology also strives to provide low computational complexity, low memory requirements, and a small algorithmic delay particularly for real-time low-cost voice communications. FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a one-way communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system. [0003]
  • The first widely-used low bit-rate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an all-pole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy. [0004]
  • The LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 20-30 ms and transmitted to the communications channel. The artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise. In addition, the speech quality is very poor in the presence of background noise. These unintended additions to the synthesized speech are the result of the simple excitation model and the binary voicing decision error. [0005]
  • Over the years, several low bit-rate speech coding algorithms have been developed, and some state-of-the-art coders now provide a good natural quality. The mixed excitation linear predictive (MELP) coder is one of these speech coders. The MELP coder is a linear-prediction-based speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder. FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively. [0006]
  • However, the MELP still has some perceivable distortions, particularly around the non-stationary speech segments and for some low-pitch male speakers. These distortions can also be observed with other low bit-rate speech coders. The distortion around the non-stationary speech segments results from the update of speech parameters at a low frame rate (typically 30-50 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate. Another possible solution is a variable frame-rate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments. Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce high-quality speech while maintaining a relatively low bit rate by reducing the average frame rate. However, this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as two-way voice communications). [0007]
  • The distortion for low-pitch male speakers in the MELP is characterized by a high-pass filtered quality of the coded speech. In other words, the synthesized speech lacks “sound pressure” in the low frequencies. This distortion is caused by a post filter and a preprocessing high-pass filter, which are used in the modern low bit-rate speech coders to remove 60 Hz noise and to enhance the coded speech quality. These filters suppress the harmonic magnitudes in the low frequencies, particularly for low-pitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a high-pass filtered speech that is perceived as too synthetic. [0008]
  • The most significant speech distortion present in the prior art is the lack of a suitable model or method to accurately synthesize a plosive sound. Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds. Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”). Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bit-rate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear. [0009]
  • SUMMARY OF THE INVENTION
  • As described briefly, an object of the present invention is to enhance the coded speech quality of the existing low bit-rate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity. [0010]
  • Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder. Another object of the present invention is to provide bit-stream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated. [0011]
  • The present invention provides four embodiments. The first is a robust pitch-detection algorithm. In the encoder, the fixed-length pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixed-position windows in the prior art. [0012]
  • The second embodiment is a plosive analysis/synthesis method. In the encoder, the system first detects the frame that contains the plosive signal. The plosive detection is performed with sliding-window peakiness analysis. The detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder. In the decoder, the plosive signal is synthesized independently and added back to the coded speech. [0013]
  • The third embodiment is a post-processor for the Fourier magnitude model. In the decoder, the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the high-pass filtered speech of current low bit-rate speech encoders. [0014]
  • The fourth embodiment is a new mixed-excitation algorithm. In the decoder, a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the band-pass filtering operations, which are required to generate the mixed-excitation signal in the existing MELP coder. The elimination of the filters results in a significant reduction of computational complexity in the MELP decoder. As a result, the present system is shown to be compatible in terms of bit-stream and is interchangeable with the coder/decoder of the existing MELP speech coder.[0015]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention will be more fully understood from the accompanying drawings of the embodiments of the invention, which however, should not be taken to limit the invention to the specific embodiments enumerated, but are for explanation and for better understanding only. Finally, like reference numerals in the figures designate corresponding parts throughout the drawings. [0016]
  • FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder; [0017]
  • FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A; [0018]
  • FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A; [0019]
  • FIG. 2A is a block diagram highlighting the new embodiments of the present system; [0020]
  • FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A; [0021]
  • FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A; [0022]
  • FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art; [0023]
  • FIG. 3B illustrates plosive signal synthesis in coded speech; [0024]
  • FIG. 3C illustrates a typical LPC residual waveform for a plosive signal; [0025]
  • FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model; [0026]
  • FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model; [0027]
  • FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B; [0028]
  • FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C; [0029]
  • FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5; [0030]
  • FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5; [0031]
  • FIG. 8 illustrates a block diagram of the post-processor for the Fourier magnitude of the present invention as shown in FIG. 2C; [0032]
  • FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C; [0033]
  • FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames; [0034]
  • FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames. [0035]
  • FIG. 12 illustrates words with plosive sounds; [0036]
  • FIG. 13 illustrates the replacement of different plosive types in the present invention; [0037]
  • FIG. 14 reveals the bit allocation for the plosive signal model; [0038]
  • FIG. 15 reveals the 99-level Pitch and Voicing level quantization in the existing MELP; [0039]
  • FIG. 16A reveals the bit allocation in the existing MELP frame; and [0040]
  • FIG. 16B reveals the bit transmission order in the existing MELP frame. [0041]
  • DESCRIPTION OF THE PREFERRED EMBODIMENT
  • The present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bit-rate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention. The present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as [0042] 52 in FIG. 2A. The robust pitch detector 52 replaces a portion of the refinement of pitch and voicing decision 37 in the MELP coder and does not require additional bits for transmission.
  • The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A. [0043] Plosive analysis 55 is added to the encoder. Plosive synthesis 59 is added to the decoder and requires two bits for transmission.
  • The third embodiment, a post-processor for the [0044] Fourier magnitude 62, is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission.
  • The fourth embodiment, a new [0045] mixed excitation 35, is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art. The new mixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission.
  • MELP Encoder [0046]
  • FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder. A frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second. The MELP is a parametric speech coder that creates a 54-bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Band-pass Voicing, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit. [0047]
  • Input speech is encoded as follows. First, the input speech signal is processed through high-[0048] pass filter 11 with a cut-off frequency of 60 Hz to remove low-frequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations.
  • Next, the speech signal is band-passed filtered into 5 frequency bands from 0-500, 500-100, 1000-2000, 2000-3000, and 3000-4000 Hz for voicing analysis. An initial pitch estimation is made using the 0-500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame. The initial pitch estimation from the first band-pass filter is used as the initial reference point for robust pitch detector [0049] 52 (FIG. 2B). For each of the remaining frequency bands, the band-pass voicing strength is determined using the pitch determined by the robust pitch detector 52 described below. The time envelopes of each of the band-pass filters are calculated by full-wave rectification followed by a smoothing filter. The analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band.
  • Robust Pitch Detection [0050]
  • Most low bit-rate speech coders use the normalized pitch correlation to estimate pitch lag. In the MELP coder, the pitch correlation is also used to make band-pass voicing decisions. The normalized pitch correlation r(T) is computed with the signal in the fixed-position analysis window in the prior art as follows: [0051] r ( T ) = c T ( 0 , T ) c T ( 0 , 0 ) c T ( T , T ) c T ( m , n ) = k = - T 2 - N 2 - T 2 + N 2 - 1 s k + m s k + n , Eq . ( 1 )
    Figure US20020052734A1-20020502-M00001
  • where, S[0052] k is the kth sample in the fixed-position window, sO is the signal at the center of the fixed-position window, T is a pitch lag, and N is the number of samples accumulated for the correlation computation.
  • The binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or ill-defined pitch. As a result, noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions. To solve this problem, a sliding-sample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis. By using a periodically stable portion of the signal for pitch analysis, the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the non-periodically stable signal segments. [0053]
  • FIG. 4 shows a robust pitch detector used in the present invention. In FIG. 4, the normalized pitch correlation in the [0054] window 43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, Sk is the kth signal and s0 is the signal at the center of the original fixed-position window. The normalized pitch correlation in the window 43 is computed recursively as follows: r i ( T ) = c T ( i , T + i ) c T ( i , i ) c T ( T + i , T + i ) , where , c T ( i , j ) = c T ( i - 1 , j - 1 ) + s i - T 2 + N 2 - 1 s j - T 2 + N 2 - 1 - s i - 1 - T 2 - N 2 s j - 1 - T 2 - N 2 Eq . ( 2 )
    Figure US20020052734A1-20020502-M00002
  • In each window, the maximum normalized pitch correlation r[0055] i(Ti) and the associated pitch lag, Ti is determined and the final pitch lag selected as the pitch lag associated with the maximum normalized pitch correlation r(T) in all windows as follows: r ( T ) = max i = - N s N s - 1 [ max T { r i ( T ) } ] , Eq . ( 3 )
    Figure US20020052734A1-20020502-M00003
  • where, Ns is the maximum window-sliding range from the original fixed-position window. In the present invention, an LPC parameter, a gain, band-pass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation. A direct implementation of Equation (2) solving for r[0056] i (T) for all values of i would result in a significant increase in the computational complexity. To reduce the additional complexity, the recursion Equation (2) for cT (i, j) is used to compute the autocorrelation.
  • The aperiodic flag is set if V[0057] bpl, determined in the voicing analysis for the 0 to 500 Hz band-pass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.
  • A 10[0058] th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame. A traditional autocorrelation analysis procedure is implemented using Levinson-Durbin recursion. In addition, a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.
  • Next, a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal. [0059]
  • Plosive Analysis [0060]
  • The plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis. FIG. 5 shows the plosive analysis/synthesis system. [0061]
  • Plosive Detection [0062]
  • With reference to FIG. 5, the [0063] plosive detector 56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal. The peakiness value is sensitive to the phase of the plosive signal. By using a sliding window to detect a window position that maximizes the peakiness value, the phase sensitivity of the plosive is reduced. The peakiness, P, is defined as a ratio of the L2 norm to the L1 norm of the signal: P = 1 N n = 0 N - 1 r n 2 1 N n = 0 N - 1 | r n | , Eq . ( 4 )
    Figure US20020052734A1-20020502-M00004
  • where, r[0064] n is a LPC residual signal and N is a frame size. As shown in FIG. 6, the plosive detector slides the peakiness analysis window 63 to find the maximum peakiness value in all windows. The peakiness of each window is given by: P i = 1 N n = 0 N - 1 r n + i 2 1 N n = 0 N - 1 | r n + i | = 1 N B i 1 N A i , Eq . ( 5 )
    Figure US20020052734A1-20020502-M00005
  • where, P[0065] i is the peakiness of the ith window from the past, and r0 is the first LPC residual signal in the original fixed-position window. In FIG. 6, the peakiness in the window 63 (P−Ns) is first computed. The peakiness in the window 63 is computed recursively as follows:
  • A l =A i−1 +|r N−1=i |−|r i−1|
  • Bi =B i−1 =r N−1=i 2 −r t−1 2,  Eq. (6)
  • Then, the maximum peakiness value in all windows is used as the peakiness value P of the frame: [0066] P = max i = - N s N s - 1 [ P i ] , Eq . ( 7 )
    Figure US20020052734A1-20020502-M00006
  • where, N[0067] s is the maximum window-sliding range, which is also used for the pitch detector of the present invention. The peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform. In addition to the peakiness value, the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.
  • Plosive Modeling [0068]
  • In the present invention, a simple model is applied to the plosive signal expression in [0069] plosive modeling 57 of FIG. 5 so as to minimize the additional transmission bits. FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency. FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention.
  • In this model, all plosive signals p(n) are produced by scaling and LPC synthesis filtering the single pre-stored template LPC residual signal v(n) as follows: [0070] p ( n ) = g p v ( n ) + i = 1 P a i p ( n - 1 ) , Eq . ( 8 )
    Figure US20020052734A1-20020502-M00007
  • where, g[0071] p is the scaling factor based on the energy of the input plosive signal, and ai are the LPC coefficients computed from the input plosive signal. The template plosive signal v(n) was chosen arbitrarily and filtered with the 14th order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10th order LPC model used for voiced pitch modeling is used for the production of the plosive signal.
  • The parameters for transmission are a plosive flag, a plosive location, and plosive gain. The gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal. For the specific embodiment of the present invention, the gain is quantized with two bits. The position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame. Thus, for the specific embodiment of the present invention, the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14. In the present invention, plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder. [0072]
  • Next, the input speech signal gain is measured twice per frame using a pitch adaptive window length. This adaptive length is identical for both gain measurements and is determined as follows. When V[0073] bp1>0.6, the length is the shortest multiple of P2 which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2. When Vbpl is less than or equal to 0.6, the window length is 120 samples. The gain calculation for the first window produces G1 and is centered 90 samples before the last sample of the current frame. The calculation for the second window produces G2 and is centered on the last sample of the current frame. The gain is the RMS value, measured in dB, of the signal in the window sn: G i = 10 log 10 ( 0.01 + 1 L n = 1 L s n 2 ) , Eq . ( 9 )
    Figure US20020052734A1-20020502-M00008
  • where, L is the window length. The 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is −32768 to 32767. [0074]
  • Next, the encoder performs a quantization of the LPC coefficients. First, the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation. The resulting LSF vector f is quantized using a multi-stage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder. [0075]
  • The final pitch value, P[0076] 3, is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if Vbpl is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.
  • The two gain values are quantized as follows. G[0077] 2 is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB. G1 is quantized to 3 bits using the following adaptive algorithm. If G2 for the current frame is within 5 dB of G2 for the previous frame, and G1 is within 3 dB of the average of G2 values for the current and previous frames, then the frame is steady-state and a code of all zeros is sent to indicate that the decoder should set G1 to the mean of G2 values for the current and previous frames. Otherwise, the frame represents a transition and G1 is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G2 values for the current and previous frames to 6 dB above the maximum of those G2 values.
  • Band-pass voicing quantization occurs as follows. When V[0078] bpl is less than or equal to 0.6 (unvoiced state), the remaining strengths Vbpi, i=2, 3, 4, 5 are set to 0. When Vbpl is >0.6, the remaining voicing strengths are quantized to 1.
  • Fourier Magnitude calculation and quantization occurs as follows. The Fourier magnitudes of the first 10 pitch harmonics of the prediction signal residual generated by the quantized prediction coefficients. It uses a 512 point Fast Fourier Transform (FFT) of a 200 sample window centered at the end of the frame. First, a set of quantized predictor coefficients are calculated from the quantized LSF vector. Then, the residual window is generated using the quantized prediction coefficients. Next, a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed. Finally, the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peak-selecting algorithm. [0079]
  • The peak-selecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer. The initial estimate for the location of the i[0080] th harmonic is 512 i/P. The number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.
  • The 10 magnitudes are quantized with an 8-bit quantizer. The codebook is searched for a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies. The weights are given by: [0081] w i = [ 117 25 + 75 ( 1 + 1.4 ( f i 1000 ) 2 ) 0.69 ] 2 , i = 1 , 2 , , 10 Eq . ( 10 )
    Figure US20020052734A1-20020502-M00009
  • where,f[0082] i=8000i/60 is the frequency in Hz corresponding to the ith harmonic for a default pitch period of 60 samples. The weights are applied to the squared difference between the input Fourier magnitudes and the codebook values.
  • Lastly, the MELP encoder adds error protection and structures the 54-bit frame as follows. FIG. 12 shows the bit allocation for the MELP coder. To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with forward error correction. Three Hamming (7,4) codes and one Hamming (8,4) code may be used. The (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors. The (8,4) code is applied to the 4 most significant bits (MSBs) of the first multi-stage vector quantization index, and the 4 parity bits are written over the band-pass voicing. The remaining three bits of the first multi-stage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index. The 4 MSBs of the G[0083] 2 codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes. Finally, the least significant bit (LSB) of the second gain index and the 3 bit G1 codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit. The parity generator matrix for the Hamming (7,4) code is: G 7 , 4 = [ 1 1 0 1 1 0 1 1 0 1 1 1 ] . Eq . ( 11 )
    Figure US20020052734A1-20020502-M00010
  • The parity generator matrix for the Hamming (8,4) code is: [0084] G 8 , 4 = [ 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 ] . Eq . ( 12 )
    Figure US20020052734A1-20020502-M00011
  • FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame. FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes. [0085]
  • MELP Decoder [0086]
  • The received bit stream is unpacked from the [0087] communications channel 18 and assembled into the parametric codewords. Parameter decoding differs for the voiced and unvoiced frames. Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used.
  • In the unvoiced mode, the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors. [0088]
  • If an erasure is indicated in the current frame, by the Hamming code, by the pitch code, or directly signaled from the [0089] communication channel 18, then a frame repeat mechanism is implemented. All of the parameters for the current frame are replaced with the parameters from the previous frame. In addition, the first gain term is set equal to the second gain term so that no gain transitions are permitted.
  • If an erasure is not indicated, the remaining parameters are decoded. The LSFs are checked for ascending order and a minimum separation of 50 Hz. In the unvoiced mode, default parameter values are used for the pitch, jitter, band-pass voicing, and Fourier magnitudes. The pitch value is set to 50 samples, the jitter is set to 25%, the band-pass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0. In the voiced mode, V[0090] bpl is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%. The band-pass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.
  • When the special all zero code for the first gain parameter G[0091] 1, is received, some errors in the second gain parameter, G2, can be detected and corrected. This correction process provides improved performance in channel errors.
  • For quiet input signals, a small amount of gain attenuation is applied to both gain parameters using a power subtraction rule. This attenuation is a simplified, frequency invariant case of a smooth spectral subtraction noise suppression method. The background noise estimate is also used in the adaptive spectral enhancement calculation. [0092]
  • Gain, G[0093] 1, is then modified by subtracting a positive correction term, Gatt, given in dB by:
  • G att=−10log10(1−1001[G n +3−G 1 ]).  Eq. (13)
  • All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period. The interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral-tilt coefficient for the adaptive spectral-enhancement filter. Gain is linearly interpolated between the gain of the prior frame, G[0094] 2p, and the first gain of the current frame, G1, if the starting point, t0, t0=0, 1, . . . , 179, of the new pitch period is less than 90; otherwise, gain is interpolated between the G1 and G2. Normally, the other parameters are linearly interpolated between the past and current frame values. The interpolation factor, int, for these parameters is based on the starting point of the new pitch period:
  • int=t 0 /180   Eq. (14)
  • There are two exceptions to the interpolation procedure. First, there is an onset with a high pitch frequency, pitch interpolation is disabled and the new pitch is immediately used. This condition is met when G[0095] 1 is more than 6 dB greater than G2 and the current frame's pitch period is less than half the prior frame's pitch period. The second exception also involves a gain onset. If G2 differs from G2p by more than 6 dB, then the LSFs, spectral tilt, and pitch are interpolated using the interpolated gain trajectory as a basis, since the gain is transmitted twice per frame and has a more accurate interpolation path. In this case, the interpolation factor is given by: int = G int - G 2 p G 2 - G 2 p , Eq . ( 15 )
    Figure US20020052734A1-20020502-M00012
  • where G[0096] int is the interpolated gain. This interpolation factor is then clamped between 0 and 1.
  • New Mixed Excitation Algorithm [0097]
  • Although the mixed excitation method in the existing MELP coder minimizes the band-pass filtering operations, it still requires two 32[0098] nd order FIR filtering operations for a pulse train and noise. The present invention removes these filters to reduce the computational complexity of the existing MELP. FIG. 9 shows a new mixed-excitation algorithm in the present invention. The existing MELP uses the Fourier magnitudes to generate a pulse train. The pulse train is mixed with random noise in time domain by band-pass filtering. In the present invention, noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes. Block 64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands. The mixed excitation signal in the present method is given by: e m ( n ) = 1 2 π - π π E M ( j ω ) j ω n ω , E M ( j ω ) = E 0 ( j ω )
    Figure US20020052734A1-20020502-M00013
  • If, ω=0, ω=π, or in the voiced band, [0099]
  • otherwise,[0100]
  • E m(e )=E 0(e )e jω100 , φ=U[−απ, απ],   Eq. (16)
  • where, cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitch-synchronously, the band-pass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced). [0101]
  • The adaptive spectral enhancement filter is then applied to the mixed excitation signal. This filter is a 10[0102] th order pole/zero filter with additional first order tilt compensation. The coefficients are generated by bandwidth expansion of the LPC filter transfer function A(z), corresponding to the interpolated LSFs. The transfer function of the enhancement filter, Hase(Z), is given by: H ase ( z ) = A ( α z - 1 ) A ( β z - 1 ) × ( 1 + μ z - 1 ) , Eq . ( 17 )
    Figure US20020052734A1-20020502-M00014
  • where,[0103]
  • α=0.5p β=0.8p′  Eq. (18)
  • and tilt coefficient, μ, is first calculated as max(0.5k[0104] l0), then interpolated and multiplied by p, the signal probability. The first reflection coefficient, kl, is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, kl, is usually negative for the voiced spectra. The signal probability p is estimated by comparing the current interpolated gain, Gint, to the background noise estimate, Gn, using the formula: p = G int - G n - 12 18 . Eq . ( 19 )
    Figure US20020052734A1-20020502-M00015
  • This signal probability is clamped between 0 and 1. [0105]
  • Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter. [0106]
  • Since excitation of the synthesized voice signal is generated at an arbitrary level, a speech gain adjustment must be performed on the synthesized speech. The correct scaling factor, S[0107] gain, is computed for each synthesized pitch period of length Tby dividing the desired RMS value (Gint, must be converted from dB) by the RMS value of the unsealed synthetic speech signal sn: S gain = 10 G int 20 1 T n = 1 T s n 2 . Eq . ( 20 )
    Figure US20020052734A1-20020502-M00016
  • To prevent discontinuities in the synthesized speech, this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period. [0108]
  • The pulse dispersion filter is a 65[0109] th order FIR filter derived from a spectrally flattened triangular pulse. The coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.
  • Post-Processor for the Fourier Magnitude Model [0110]
  • In the present invention, a post-processor for the [0111] Fourier magnitude model 62 is added to the MELP decoder as shown in FIG. 2A. In the prior art, it was observed that the first few harmonic magnitudes of the coded speech for some low-pitch male speakers were suppressed by the preprocessing high-pass filter 11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF) 30 in FIG. 2C. It was found that this effect led to a high-pass filtered quality for low-pitch male speakers. To provide more natural speech quality for such speakers, the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters. The emphasized harmonic magnitude is given by: S ~ ( j ω i ) = S ( j ω i ) G H ( j ω i ) , Eq . ( 21 )
    Figure US20020052734A1-20020502-M00017
  • where, ω[0112] i is the i th harmonic frequency, G is the average Fourier spectrum energy, and |S(e)| is the non-emphasized Fourier magnitude of the ith harmonic. As shown in FIG. 8, the present invention uses the MELP Fourier magnitude parameters, which are the Fourier magnitudes of the LPC residual signal 23, for the harmonic magnitude emphasis rather than using the harmonic magnitude of the synthesized speech S(e). From Parseval's theorem, the average Fourier spectrum magnitude G is given by: G = n = 0 N - 1 h ( n ) 2 , Eq . ( 22 )
    Figure US20020052734A1-20020502-M00018
  • where, h(n) is the impulse response of the filter H(e[0113] ), and N is the length of impulse response. The magnitude response of the filter |H(e)|, is given by:
  • |H(e )|=|H 1(e )||H 2(e )|,  Eq. (23)
  • where, H[0114] 1 (e) and H2 (e) are the magnitude responses of the ASEF 30 and preprocessing high-pass filter 11 respectively. To avoid losing the advantage of the ASEF 30 in the prior art, the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame. The first formant frequency F1 is roughly estimated using quantized line spectrum frequencies (LSFs) as follows: F 1 = f ^ 1 + f ^ 2 2 ,
    Figure US20020052734A1-20020502-M00019
  • otherwise, [0115] F 1 = f ^ 2 + f ^ 3 2 , Eq . ( 23 )
    Figure US20020052734A1-20020502-M00020
  • where, {circumflex over (f)}[0116] i is the ith quantized LSF. From the experimental result, the emphasized harmonic magnitude |{tilde over (S)}(e)| is further emphasized by 2 dB in the present invention.
  • Plosive Synthesis [0117]
  • FIG. 7 shows the block diagram of the [0118] plosive synthesis 66. As shown in FIG. 7, all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosive residual template 71, which is pre-stored in the synthesizer. This plosive residual template 71 was chosen arbitrarily and filtered with the 14th order LPC inverse filter. The LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis. The gain of synthesized plosive signal is adjusted by applying plosive gain 76 to the MELP gain 34. In the present invention, the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown in block 73. Before the plosive is added back to the coded speech, the gain of the coded speech is adjusted in gain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame:
  • [0119] g i (0)=g i-1(1), if the plosive position is the first half of the frame, otherwise, g i(1)=g i(0), if the plosive position is the second half of the frame, where, gi is the j th gain (j=0,1) in the ith frame. Since plosive detection, modeling and synthesis are performed independently from the MELP coder as shown in FIG. 5, this embodiment can be applied to other low bit-rate speech coders.
  • Bit Allocation [0120]
  • Another advantage of the present invention is bit-stream compatibility with the existing MELP coder. The present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a post-processor for the Fourier magnitude model and a new mixed-excitation algorithm. As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission. In the present invention, the additional bits for the plosive can be packed into the bit-stream of the existing MELP. There are two different modes for the bit allocation of the existing MELP: one voiced, the other unvoiced. The mode is selected as voiced if the first band is voiced and as unvoiced if the first band is unvoiced. For unvoiced mode, the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced. In the decoder, if the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced. In the present invention, a frame that contains a plosive is assumed to be a unvoiced frame. FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy. Then, a plosive gain and position is packed into the bits for the Fourier magnitude, which is used for the voiced frame in the existing MELP. FIG. 11 shows the bit unpacking flow diagram for the plosive signal. The decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP. In the decoder of the present invention, the frame is regarded as the plosive frame if this combination occurs. Then, the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bit-stream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP. [0121]
  • While preferred embodiments of the invention have been disclosed in detail in the foregoing description and drawings, it will be understood by those skilled in the art that variations and modifications thereof can be made without departing from the spirit and scope of the invention as set forth in the following claims. [0122]

Claims (18)

Therefore, having thus described the invention, we claim:
1. A method of enhancing the speech quality of a speech coder comprising the steps of:
digitally sampling speech to create a speech waveform over a multiplicity of frames;
using a sliding-sample window to locate a frame position with the highest pitch correlation; and
formulating at least one synthesized voice parameter in response to the speech waveform within the located frame position.
2. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch over multiple frame positions defined by the sliding-sample window.
3. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for a fixed-length sliding-sample window.
4. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for up to a predetermined number of frames.
5. The method of claim 1, wherein the step of formulating comprises estimating a frame pitch in response to the signal contained within the located frame position.
6. The method of claim 5, further comprising the step of:
estimating linear predictive coding (LPC) coefficients in response to the signal contained within the located frame position.
7. The method of claim 5, further comprising the step of:
estimating gain in response to the signal contained within the located frame position.
8. The method of claim 5, further comprising the step of:
estimating a voicing decision in response to the signal contained within the located frame position.
9. The method of claim 5, further comprising the step of:
estimating a fractional pitch in response to the signal contained within the located frame position.
10. A speech coder comprising:
means for sampling a speech waveform to generate a discrete representation of the speech waveform over a multiplicity of frames; and
means for locating a pitch-analysis window over that frame position with the highest pitch correlation.
11. The coder of claim 10, wherein the means for locating a frame position with the highest pitch correlation compares pitch analysis results associated with multiple frames.
12. The coder of claim 10, wherein the means for locating a frame position with the highest pitch correlation performs a recurrence calculation on the autocorrelation of the pitch for multiple frame positions defined by the sliding-sample window.
13. The coder of claim 10, wherein the means for locating a pitch-analysis window comprises a fixed-length window.
14. The coder of claim 13, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of pitch results from multiple frame positions defined by the fixed-length window.
15. The coder of claim 10, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch from up to a predetermined number of frames defined by the sliding-sample window.
16. The coder of claim 10, further comprising:
means for estimating a plurality of speech parameters in response to the signal contained within the located frame position.
17. The coder of claim 16, wherein the means for estimating comprises at least one digital signal processor in the mixed-excitation linear predictive (MELP) coder.
18. The coder of claim 16, wherein the means for estimating comprises at least one algorithm stored within the mixed-excitation linear predictive (MELP) coder.
US09/991,387 1999-02-04 2001-11-16 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders Abandoned US20020052734A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/991,387 US20020052734A1 (en) 1999-02-04 2001-11-16 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11864499P 1999-02-04 1999-02-04
US09/408,195 US6453287B1 (en) 1999-02-04 1999-09-29 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders
US09/991,387 US20020052734A1 (en) 1999-02-04 2001-11-16 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/408,195 Division US6453287B1 (en) 1999-02-04 1999-09-29 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

Publications (1)

Publication Number Publication Date
US20020052734A1 true US20020052734A1 (en) 2002-05-02

Family

ID=26816592

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/408,195 Expired - Fee Related US6453287B1 (en) 1999-02-04 1999-09-29 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders
US09/991,387 Abandoned US20020052734A1 (en) 1999-02-04 2001-11-16 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/408,195 Expired - Fee Related US6453287B1 (en) 1999-02-04 1999-09-29 Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

Country Status (1)

Country Link
US (2) US6453287B1 (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220783A1 (en) * 2002-03-12 2003-11-27 Sebastian Streich Efficiency improvements in scalable audio coding
US20050131680A1 (en) * 2002-09-13 2005-06-16 International Business Machines Corporation Speech synthesis using complex spectral modeling
US6968309B1 (en) * 2000-10-31 2005-11-22 Nokia Mobile Phones Ltd. Method and system for speech frame error concealment in speech decoding
US20070055502A1 (en) * 2005-02-15 2007-03-08 Bbn Technologies Corp. Speech analyzing system with speech codebook
US20070094009A1 (en) * 2005-10-26 2007-04-26 Ryu Sang-Uk Encoder-assisted frame loss concealment techniques for audio coding
US20070299659A1 (en) * 2006-06-21 2007-12-27 Harris Corporation Vocoder and associated method that transcodes between mixed excitation linear prediction (melp) vocoders with different speech frame rates
US20080069364A1 (en) * 2006-09-20 2008-03-20 Fujitsu Limited Sound signal processing method, sound signal processing apparatus and computer program
US20090024398A1 (en) * 2006-09-12 2009-01-22 Motorola, Inc. Apparatus and method for low complexity combinatorial coding of signals
US20090100121A1 (en) * 2007-10-11 2009-04-16 Motorola, Inc. Apparatus and method for low complexity combinatorial coding of signals
US20090112607A1 (en) * 2007-10-25 2009-04-30 Motorola, Inc. Method and apparatus for generating an enhancement layer within an audio coding system
US20090234642A1 (en) * 2008-03-13 2009-09-17 Motorola, Inc. Method and Apparatus for Low Complexity Combinatorial Coding of Signals
US20090259477A1 (en) * 2008-04-09 2009-10-15 Motorola, Inc. Method and Apparatus for Selective Signal Coding Based on Core Encoder Performance
US20100106509A1 (en) * 2007-06-27 2010-04-29 Osamu Shimada Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system
US20100169100A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Selective scaling mask computation based on peak detection
US20100169099A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system
US20100169087A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Selective scaling mask computation based on peak detection
US20100169101A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system
US20110156932A1 (en) * 2009-12-31 2011-06-30 Motorola Hybrid arithmetic-combinatorial encoder
US20110218799A1 (en) * 2010-03-05 2011-09-08 Motorola, Inc. Decoder for audio signal including generic audio and speech frames
US20110218797A1 (en) * 2010-03-05 2011-09-08 Motorola, Inc. Encoder for audio signal including generic audio and speech frames
US9129600B2 (en) 2012-09-26 2015-09-08 Google Technology Holdings LLC Method and apparatus for encoding an audio signal
CN105118513A (en) * 2015-07-22 2015-12-02 重庆邮电大学 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP
US9245538B1 (en) * 2010-05-20 2016-01-26 Audience, Inc. Bandwidth enhancement of speech signals assisted by noise reduction
US9343056B1 (en) 2010-04-27 2016-05-17 Knowles Electronics, Llc Wind noise detection and suppression
US20160203826A1 (en) * 2013-07-12 2016-07-14 Orange Optimized scale factor for frequency band extension in an audio frequency signal decoder
US9431023B2 (en) 2010-07-12 2016-08-30 Knowles Electronics, Llc Monaural noise suppression based on computational auditory scene analysis
US9438992B2 (en) 2010-04-29 2016-09-06 Knowles Electronics, Llc Multi-microphone robust noise suppression
US9502048B2 (en) 2010-04-19 2016-11-22 Knowles Electronics, Llc Adaptively reducing noise to limit speech distortion
US9699554B1 (en) 2010-04-21 2017-07-04 Knowles Electronics, Llc Adaptive signal equalization
WO2019142513A1 (en) * 2018-01-17 2019-07-25 日本電信電話株式会社 Encoding device, decoding device, fricative determination device, and method and program thereof
WO2019142514A1 (en) * 2018-01-17 2019-07-25 日本電信電話株式会社 Decoding device, encoding device, method and program thereof
CN110503966A (en) * 2019-09-06 2019-11-26 成都理工大学 MELP/CELP mixing voice navamander and coding method based on rail
CN110610713A (en) * 2019-08-28 2019-12-24 南京梧桐微电子科技有限公司 Vocoder residue spectrum amplitude parameter reconstruction method and system
US11587573B2 (en) 2019-09-17 2023-02-21 Acer Incorporated Speech processing method and device thereof

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPQ366799A0 (en) * 1999-10-26 1999-11-18 University Of Melbourne, The Emphasis of short-duration transient speech features
US6963833B1 (en) * 1999-10-26 2005-11-08 Sasken Communication Technologies Limited Modifications in the multi-band excitation (MBE) model for generating high quality speech at low bit rates
WO2001059766A1 (en) * 2000-02-11 2001-08-16 Comsat Corporation Background noise reduction in sinusoidal based speech coding systems
US6910007B2 (en) * 2000-05-31 2005-06-21 At&T Corp Stochastic modeling of spectral adjustment for high quality pitch modification
EP1203369B1 (en) * 2000-06-20 2005-08-31 Koninklijke Philips Electronics N.V. Sinusoidal coding
US6947888B1 (en) * 2000-10-17 2005-09-20 Qualcomm Incorporated Method and apparatus for high performance low bit-rate coding of unvoiced speech
US20030028386A1 (en) * 2001-04-02 2003-02-06 Zinser Richard L. Compressed domain universal transcoder
US6789058B2 (en) * 2002-10-15 2004-09-07 Mindspeed Technologies, Inc. Complexity resource manager for multi-channel speech processing
US7310597B2 (en) * 2003-01-31 2007-12-18 Harris Corporation System and method for enhancing bit error tolerance over a bandwidth limited channel
US20040172307A1 (en) * 2003-02-06 2004-09-02 Gruber Martin A. Electronic medical record method
WO2004084181A2 (en) * 2003-03-15 2004-09-30 Mindspeed Technologies, Inc. Simple noise suppression model
US7596488B2 (en) * 2003-09-15 2009-09-29 Microsoft Corporation System and method for real-time jitter control and packet-loss concealment in an audio signal
US20050091044A1 (en) * 2003-10-23 2005-04-28 Nokia Corporation Method and system for pitch contour quantization in audio coding
US20050091041A1 (en) * 2003-10-23 2005-04-28 Nokia Corporation Method and system for speech coding
US20070118364A1 (en) * 2005-11-23 2007-05-24 Wise Gerald B System for generating closed captions
US20070118372A1 (en) * 2005-11-23 2007-05-24 General Electric Company System and method for generating closed captions
US20080195381A1 (en) * 2007-02-09 2008-08-14 Microsoft Corporation Line Spectrum pair density modeling for speech applications
US8688441B2 (en) * 2007-11-29 2014-04-01 Motorola Mobility Llc Method and apparatus to facilitate provision and use of an energy value to determine a spectral envelope shape for out-of-signal bandwidth content
US8433582B2 (en) * 2008-02-01 2013-04-30 Motorola Mobility Llc Method and apparatus for estimating high-band energy in a bandwidth extension system
US20090201983A1 (en) * 2008-02-07 2009-08-13 Motorola, Inc. Method and apparatus for estimating high-band energy in a bandwidth extension system
US8463412B2 (en) * 2008-08-21 2013-06-11 Motorola Mobility Llc Method and apparatus to facilitate determining signal bounding frequencies
CN101599272B (en) * 2008-12-30 2011-06-08 华为技术有限公司 Keynote searching method and device thereof
US8463599B2 (en) * 2009-02-04 2013-06-11 Motorola Mobility Llc Bandwidth extension method and apparatus for a modified discrete cosine transform audio coder
US8340965B2 (en) * 2009-09-02 2012-12-25 Microsoft Corporation Rich context modeling for text-to-speech engines
US8594993B2 (en) 2011-04-04 2013-11-26 Microsoft Corporation Frame mapping approach for cross-lingual voice transformation
WO2013019562A2 (en) * 2011-07-29 2013-02-07 Dts Llc. Adaptive voice intelligibility processor

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3836717A (en) * 1971-03-01 1974-09-17 Scitronix Corp Speech synthesizer responsive to a digital command input
US4618985A (en) * 1982-06-24 1986-10-21 Pfeiffer J David Speech synthesizer
US4771465A (en) * 1986-09-11 1988-09-13 American Telephone And Telegraph Company, At&T Bell Laboratories Digital speech sinusoidal vocoder with transmission of only subset of harmonics
US5278943A (en) * 1990-03-23 1994-01-11 Bright Star Technology, Inc. Speech animation and inflection system
US5839102A (en) * 1994-11-30 1998-11-17 Lucent Technologies Inc. Speech coding parameter sequence reconstruction by sequence classification and interpolation
WO1999010719A1 (en) * 1997-08-29 1999-03-04 The Regents Of The University Of California Method and apparatus for hybrid coding of speech at 4kbps
US6304842B1 (en) * 1999-06-30 2001-10-16 Glenayre Electronics, Inc. Location and coding of unvoiced plosives in linear predictive coding of speech

Cited By (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6968309B1 (en) * 2000-10-31 2005-11-22 Nokia Mobile Phones Ltd. Method and system for speech frame error concealment in speech decoding
US20030220783A1 (en) * 2002-03-12 2003-11-27 Sebastian Streich Efficiency improvements in scalable audio coding
US7277849B2 (en) * 2002-03-12 2007-10-02 Nokia Corporation Efficiency improvements in scalable audio coding
US20050131680A1 (en) * 2002-09-13 2005-06-16 International Business Machines Corporation Speech synthesis using complex spectral modeling
US8280724B2 (en) * 2002-09-13 2012-10-02 Nuance Communications, Inc. Speech synthesis using complex spectral modeling
US20070055502A1 (en) * 2005-02-15 2007-03-08 Bbn Technologies Corp. Speech analyzing system with speech codebook
US8219391B2 (en) * 2005-02-15 2012-07-10 Raytheon Bbn Technologies Corp. Speech analyzing system with speech codebook
US20070094009A1 (en) * 2005-10-26 2007-04-26 Ryu Sang-Uk Encoder-assisted frame loss concealment techniques for audio coding
US8620644B2 (en) * 2005-10-26 2013-12-31 Qualcomm Incorporated Encoder-assisted frame loss concealment techniques for audio coding
US20070299659A1 (en) * 2006-06-21 2007-12-27 Harris Corporation Vocoder and associated method that transcodes between mixed excitation linear prediction (melp) vocoders with different speech frame rates
US8589151B2 (en) * 2006-06-21 2013-11-19 Harris Corporation Vocoder and associated method that transcodes between mixed excitation linear prediction (MELP) vocoders with different speech frame rates
US20090024398A1 (en) * 2006-09-12 2009-01-22 Motorola, Inc. Apparatus and method for low complexity combinatorial coding of signals
US8495115B2 (en) 2006-09-12 2013-07-23 Motorola Mobility Llc Apparatus and method for low complexity combinatorial coding of signals
US9256579B2 (en) 2006-09-12 2016-02-09 Google Technology Holdings LLC Apparatus and method for low complexity combinatorial coding of signals
US20080069364A1 (en) * 2006-09-20 2008-03-20 Fujitsu Limited Sound signal processing method, sound signal processing apparatus and computer program
US20100106509A1 (en) * 2007-06-27 2010-04-29 Osamu Shimada Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system
US8788264B2 (en) * 2007-06-27 2014-07-22 Nec Corporation Audio encoding method, audio decoding method, audio encoding device, audio decoding device, program, and audio encoding/decoding system
US20090100121A1 (en) * 2007-10-11 2009-04-16 Motorola, Inc. Apparatus and method for low complexity combinatorial coding of signals
US8576096B2 (en) 2007-10-11 2013-11-05 Motorola Mobility Llc Apparatus and method for low complexity combinatorial coding of signals
US20090112607A1 (en) * 2007-10-25 2009-04-30 Motorola, Inc. Method and apparatus for generating an enhancement layer within an audio coding system
US8209190B2 (en) 2007-10-25 2012-06-26 Motorola Mobility, Inc. Method and apparatus for generating an enhancement layer within an audio coding system
US20090234642A1 (en) * 2008-03-13 2009-09-17 Motorola, Inc. Method and Apparatus for Low Complexity Combinatorial Coding of Signals
US8639519B2 (en) 2008-04-09 2014-01-28 Motorola Mobility Llc Method and apparatus for selective signal coding based on core encoder performance
US20090259477A1 (en) * 2008-04-09 2009-10-15 Motorola, Inc. Method and Apparatus for Selective Signal Coding Based on Core Encoder Performance
US20100169087A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Selective scaling mask computation based on peak detection
US8219408B2 (en) 2008-12-29 2012-07-10 Motorola Mobility, Inc. Audio signal decoder and method for producing a scaled reconstructed audio signal
US8175888B2 (en) 2008-12-29 2012-05-08 Motorola Mobility, Inc. Enhanced layered gain factor balancing within a multiple-channel audio coding system
US8340976B2 (en) 2008-12-29 2012-12-25 Motorola Mobility Llc Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system
US8200496B2 (en) 2008-12-29 2012-06-12 Motorola Mobility, Inc. Audio signal decoder and method for producing a scaled reconstructed audio signal
US20100169100A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Selective scaling mask computation based on peak detection
US8140342B2 (en) 2008-12-29 2012-03-20 Motorola Mobility, Inc. Selective scaling mask computation based on peak detection
US20100169099A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system
US20100169101A1 (en) * 2008-12-29 2010-07-01 Motorola, Inc. Method and apparatus for generating an enhancement layer within a multiple-channel audio coding system
US20110156932A1 (en) * 2009-12-31 2011-06-30 Motorola Hybrid arithmetic-combinatorial encoder
US8149144B2 (en) 2009-12-31 2012-04-03 Motorola Mobility, Inc. Hybrid arithmetic-combinatorial encoder
US8423355B2 (en) 2010-03-05 2013-04-16 Motorola Mobility Llc Encoder for audio signal including generic audio and speech frames
US20110218799A1 (en) * 2010-03-05 2011-09-08 Motorola, Inc. Decoder for audio signal including generic audio and speech frames
US20110218797A1 (en) * 2010-03-05 2011-09-08 Motorola, Inc. Encoder for audio signal including generic audio and speech frames
US8428936B2 (en) 2010-03-05 2013-04-23 Motorola Mobility Llc Decoder for audio signal including generic audio and speech frames
US9502048B2 (en) 2010-04-19 2016-11-22 Knowles Electronics, Llc Adaptively reducing noise to limit speech distortion
US9699554B1 (en) 2010-04-21 2017-07-04 Knowles Electronics, Llc Adaptive signal equalization
US9343056B1 (en) 2010-04-27 2016-05-17 Knowles Electronics, Llc Wind noise detection and suppression
US9438992B2 (en) 2010-04-29 2016-09-06 Knowles Electronics, Llc Multi-microphone robust noise suppression
US9245538B1 (en) * 2010-05-20 2016-01-26 Audience, Inc. Bandwidth enhancement of speech signals assisted by noise reduction
US9431023B2 (en) 2010-07-12 2016-08-30 Knowles Electronics, Llc Monaural noise suppression based on computational auditory scene analysis
US9129600B2 (en) 2012-09-26 2015-09-08 Google Technology Holdings LLC Method and apparatus for encoding an audio signal
US20160203826A1 (en) * 2013-07-12 2016-07-14 Orange Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10672412B2 (en) 2013-07-12 2020-06-02 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US20180018982A1 (en) * 2013-07-12 2018-01-18 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US20180082699A1 (en) * 2013-07-12 2018-03-22 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10943594B2 (en) 2013-07-12 2021-03-09 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10438599B2 (en) * 2013-07-12 2019-10-08 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10438600B2 (en) * 2013-07-12 2019-10-08 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10446163B2 (en) * 2013-07-12 2019-10-15 Koniniklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10943593B2 (en) 2013-07-12 2021-03-09 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
US10783895B2 (en) 2013-07-12 2020-09-22 Koninklijke Philips N.V. Optimized scale factor for frequency band extension in an audio frequency signal decoder
CN105118513B (en) * 2015-07-22 2018-12-28 重庆邮电大学 A kind of 1.2kb/s low bit rate speech coding method based on mixed excitation linear prediction MELP
CN105118513A (en) * 2015-07-22 2015-12-02 重庆邮电大学 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP
WO2019142513A1 (en) * 2018-01-17 2019-07-25 日本電信電話株式会社 Encoding device, decoding device, fricative determination device, and method and program thereof
CN111602197A (en) * 2018-01-17 2020-08-28 日本电信电话株式会社 Decoding device, encoding device, methods thereof, and program
WO2019142514A1 (en) * 2018-01-17 2019-07-25 日本電信電話株式会社 Decoding device, encoding device, method and program thereof
CN110610713A (en) * 2019-08-28 2019-12-24 南京梧桐微电子科技有限公司 Vocoder residue spectrum amplitude parameter reconstruction method and system
CN110503966A (en) * 2019-09-06 2019-11-26 成都理工大学 MELP/CELP mixing voice navamander and coding method based on rail
US11587573B2 (en) 2019-09-17 2023-02-21 Acer Incorporated Speech processing method and device thereof

Also Published As

Publication number Publication date
US6453287B1 (en) 2002-09-17

Similar Documents

Publication Publication Date Title
US6453287B1 (en) Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders
Supplee et al. MELP: the new federal standard at 2400 bps
KR100388388B1 (en) Method and apparatus for synthesizing speech using regerated phase information
KR100389178B1 (en) Voice/unvoiced classification of speech for use in speech decoding during frame erasures
KR101032119B1 (en) Method and device for efficient frame erasure concealment in linear predictive based speech codecs
KR100389179B1 (en) Pitch delay modification during frame erasures
US5754974A (en) Spectral magnitude representation for multi-band excitation speech coders
JP5149198B2 (en) Method and device for efficient frame erasure concealment within a speech codec
US8315860B2 (en) Interoperable vocoder
US6931373B1 (en) Prototype waveform phase modeling for a frequency domain interpolative speech codec system
KR100433608B1 (en) Improved adaptive codebook-based speech compression system
US7286982B2 (en) LPC-harmonic vocoder with superframe structure
US7013269B1 (en) Voicing measure for a speech CODEC system
US6996523B1 (en) Prototype waveform magnitude quantization for a frequency domain interpolative speech codec system
EP0673013B1 (en) Signal encoding and decoding system
EP1598811B1 (en) Decoding apparatus and method
KR20020052191A (en) Variable bit-rate celp coding of speech with phonetic classification
JPH08328591A (en) Method for adaptation of noise masking level to synthetic analytical voice coder using short-term perception weightingfilter
JP4040126B2 (en) Speech decoding method and apparatus
JP2004310088A (en) Half-rate vocoder
US20130246055A1 (en) System and Method for Post Excitation Enhancement for Low Bit Rate Speech Coding
EP0747884B1 (en) Codebook gain attenuation during frame erasures
Wang et al. Phonetic segmentation for low rate speech coding
KR100220783B1 (en) Speech quantization and error correction method
Stegmann et al. CELP coding based on signal classification using the dyadic wavelet transform

Legal Events

Date Code Title Description
AS Assignment

Owner name: GEORGIA TECH RESEARCH CORPORATION, GEORGIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:UNNO, TAKAHIRO;BARNWELL III, THOMAS P.;TRUONG, KWAN K.;REEL/FRAME:012323/0384;SIGNING DATES FROM 19990923 TO 19990926

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION