EP0059880A2

EP0059880A2 - Text-to-speech synthesis system

Info

Publication number: EP0059880A2
Application number: EP82101379A
Authority: EP
Inventors: Kun-Shan Lin; Kathleen M. Goudie; Gene A. Frantz
Original assignee: Texas Instruments Inc
Current assignee: Texas Instruments Inc
Priority date: 1981-03-05
Filing date: 1982-02-24
Publication date: 1982-09-15
Also published as: EP0059880A3

Abstract

A text-to-speech synthesis system receives digital code representative of characters from a local or remote source, and converts those character codes into speech. A set of allophone rules (16) is contained in a memory and each incoming character set is matched with the proper character set to describe the sound of that particular character set. A microcontroller (17) is dedicated to the comparison procedure which provides allophonic code when a match is made. The allophonic code is provided to a speech producing system which has a system microcontroller (11) for controlling the retrievel, from a read-only memory (12), of digital signals representative of the individual allophone parameters. The addresses at which such allophone parameters are located are directly related to the allophonic code. A dedicated microcontroller (13) concatenates the digital signals representative of the allophone parameters, including code indicating stress and intonation patterns for the allophones. The allophones are divided into a plurality of frames with one digital position indicating whether the frame is the last frame in the allophone, in which event an extra frame is introduced to provide smoothing between allophones when no stop is present and when the present allophone is voiced and the subsequent allophone is voiced, or when the present allophone is unvoiced and the subsequent allophone is unvoiced. A linear predictive coding speech synthesizer (14) receives the digital signals and provides analog signals corresponding thereto to a loud speaker (15) to produce speechlike sounds with stress and intonation.

Description

Background of the Invention

This invention pertains to an electronic text-to-speech synthesizing system and to an electronic speech producing system which may be included as a component thereof. More particularly, this invention concerns a text-to-speech synthesizing system which receives digital code such as ASCII representative of characters, determines an allophonic code for each incoming character set and sends such allophonic code to the speech producing system which decodes the allophonic code and assigns pitch for synthesizing, in a linear predictive coding speech synthesizer, speechlike sound, having unlimited vocabulary.

Description of the Prior Art

Waveform encoding and parameter encoding generally categorize the prior art techniques. Waveform encoding includes uncompressed digital data-pulse code modulation (PCM), delta modulation (DM), continuous variable slope delta modulation (CVSD) and a technique developed by Mozer (see U.S. Patent No. 4,214,125). Parameter encoding includes channel vocoder, Formant synthesis, and linear predictive coding (LPC).
PCM involves converting a speech signal into digital information using an A/D converter. Digital information is stored in memory and played back through a D/A converter through a lowpass filter, amplifier and speaker. The advantage of this approach is its simplicity. Both A/D converters and D/A converters are available and relatively inexpensive. The problem involved is the amount of data storage required. Assuming a maximum frequency of 4K Hz, and further assuming each speech sample being represented by 8 to 12 bits, one second of speech requires 64K to 96K bits of memory.
DM is a technique for compressing the speech data by assuming that the analog-speech signal is either increasing or decreasing in amplitude. The speech signal is sampled at a rate of approximately 64,000 times per second. Each sample is then compared to the estimated value of the previous sample. If the first value is greater than the estimated value of the latter, then the slope of the signal generated by the model is positive. If not, the slope is then negative. The magnitude of the slope is chosen such that it is at least as large as the maximum expected slope of the signal.
CVSD is a technique that is an extension of DM which is accomplished by allowing the slope of the generated signal to vary. The data rate in DM is typically in the order of 64K bits per second and in CVSD it is approximately 16K-32K bits per second.
The Mozer technique takes advantage of the periodicity of voiced speech waveform and the perceptual insensitivity to the phase information of the speech signal. Compressing the information in the speech waveform requires phase- angle adjustment to obtain a time-symmetrical pitch waveform which makes one-half of the waveform redundant; half period zeroing to eliminate relatively low-power segments of the waveform; digital compression using DM and repetition of pitch periods to eliminate redundant (or similar) speech segments. The data rate of this technique is approximately 2.4K bits per second.
In parameter encoding schemes, speech characteristics other than the original speech waveform are used in the analysis and synthesis. These characteristics are used to control the synthesis model to create an output speech signal which is similar to the original. The commonly used techniques attempt to describe the spectral response, the spectral peaks or the vocal tract.
The channel vocoder has a bank of band-pass filter which are designed so that the frequency range of the speech signal can be divided into relatively narrow frequency ranges. After the signal has been divided into the narrow bands the energy is detected and stored for each band. The production of the speech signal is accomplished by a bank of narrow band frequency generators, which correspond to the frequencies of the band-pass filters, controlled by pitch information extracted from the original speech signal. The signal amplitude of each of the frequency generators is determined by the energy of the original speech signal detected during the analysis. The data rate of the channel vocoder is typically in the order of 2.4K bits per second.
In formant synthesis, the short time frequency spectrum is analyzed to the extent that the spectral shape is recreated using the formant center frequencies, their band-widths and the pitch period as the inputs. The formants are the peaks in a frequency spectrum envelope. The data rate for formant synthesis is typically 500 bits per second.
Linear predicitve coding (LPC) can best be described as a mathematical model of the human vocal tract. The parameters used to control the model represent the amount of energy delivered by the lungs (amplitude), the vibration of the vocal cords (pitch period and the voiced/unvoiced decision), and the shape of the vocal tract (reflection coefficients). In the prior art, LPC synthesis has been accomplished through computer simulation techniques. More recently, LPC synthesizers have been fabricated in a semiconductor, integrated circuit chip such as that described and claimed in United States Patent No. 4,209,836 entitled "Speech Synthesis Integrated Circuit Device" and assigned to the assignee of this invention.
This invention is a combination of a speech construction technique and a speech synthesis technique. The prior art set out above involves synthesis techniques.
With respect to the speech construction techniques the library of available components sounds includes phonemes, allophones, diphones, demisyllables, morphs and combinations of these sounds.
Speech construction techniques involving phonemes are flexible techniques in the prior art. In English, there are 16 vovel phonemes and 24 consonant phonemes making a total of 40. Theoretically, any word or phrase desire should be capable of being constructed from.these phonemes. However, when each phoneme is actually pronounced there are many minor variations that may occur between sounds, which may in turn modify the pronunciation of the phoneme. This inaccuracy in representing sounds causes difficulty in understanding the resulting speech produced by the synthesis device.
Another prior art construction technique involves the use of diphones. A diphone is defined as the sound that extends from the middle of one phoneme to the middle of the next phoneme. It is chosen as a component sound to reduce smoothing requirements between adjacent diphones. However, to encompass many of the coarticulation effects in English, a large inventory of diphones is usually required. The storage requirement is in the order of 250K bytes, with a computer required to handle the construction program.
Demisyllables have been used in the prior art as component sounds for speech construction. A syllable in any language may be divided into an initial demisyllable, final demisyllable and possible phonetic affixes. The initial demisyllable consists of any initial consonants and the transition into the vowel. The final demisyllable consists of the vowel and any co-final consonants. The phonetic affixes consist of all syllable-final non-core consonants. The prior art system requires a library of 841 initial and final demisyllables and 5 phonetic affixes. The memory requirement is in the order of 50K bytes.
A morph is the smallest unit of sound that has a meaning. In a prior art system, for unrestricted English text, a dictionary of 12,000 morphs was used which required approximately 600K bytes of memory. The speech generated is intelligible and quite natural but the memory requirement is prohibitive.
An allophone is a subset of a phoneme, which is modified by the environment in which it occurs. For example, the aspirated /p/ in "push" and the unaspirated /p/ in "Spain" are different allophones of the phoneme /p/. Thus, allophones are more accurate in representing sounds than phonemes. According to the present invention, 127 allophones are stored in 3,000 bytes of memory. The storage requirement is much less than the aforementioned system using diphones, demisyllables and morphs.
Text-to-speech synthesizer systesms have been fabricated using phonemes and formant synthesis. This invention utilizes the flexibility of allophones coupled with LPC synthesis.

Brief Summary of the Invention

In this preferred embodiment, digital information in the form of ASCII code is serially entered into the system. The ASCII code may be entered from a local or remote terminal, a keyboard, a computer, etc. Of course, the particular code is simply a matter of choice and is not important to this invention. The character code is received by a microcontroller which interrogates a set of rules located in a read-only memory (ROM) to get a match for a particular character set. The rules are made up of characters which are dependent upon neighboring characters for the selection of allophonic codes. Each character set is comapred with its appropriate rule character sets until a match is found. In this preferred embodiment, the information is set in the ROM in the form of ASCII code so that a direct comparison of ASCII code is made. When a match is found, the allophonic code corresponding to the matched allophone is retrieved. The allophonic code is presented to a speech producing system which synthesizes sound through the use of a digital semiconductor LPC synthesizer. It is to be understood, however, that other sound components such as the aformentioned phonemes, diphones, demisyllables and morphs in coded forms are also contemplated for use with this LPC synthesizer. Furthermore, the allophonic code in this preferred embodiment is contemplated for use in other digital synthesizers as well as the LPC synthesizer of this preferred embodiment.
An allophone library is stored in a ROM. A microprocessor receives the allophonic code and addresses the ROM at the address corresponding to the particular allophonic code entered. An allophone, represented by its speech parameters, is retrieved from the ROM, followed by other allophones forming the words and phrases. A dedicated micro-controller is used for concatenating (stringing) the allophones to form the words and phrases. When stringing allophones, an interpolation frame of 25ms is created between allophones to smooth out sound transitions in LPC parameters. However, no interpolation is required when the voicing transition occurs. Energy is another parameter that must be smoothed. To obtain an overall smooth energy contour for the strung phrases, interpolation frames are usually created at both ends of the string with energy tapered toward zero. The smoothing technique described subsequently herein reduces the abrupt changes in sound which are usually perceived as pops, squeaks, squeals, etc.
Stress and intonation greatly contribute to the perceptual naturalness and contextual meaning of constructive speech. Stress means the emphasis of a certain syllable within a word, whereas intonation applies to the overall up-and-down patterns of pitch within a multi-syllable word, phrase or sentence. The contextual meaning of a sentence may be changed completely by assigning stress and intonation differently. Therefore, English does not sound natural if it is randomly intoned. The stress and intonation patterns which are a part of the speech construction technique herein contribute to the understandability and naturalness of the resulting speech. Stress and intonation is based on gradient pitch control of the stressed syllables preceding the primary stress of the phrase. All the secondary stress syllables of the sentence are thought of as lying along a line of pitch values tangent to the line of the pitch values of the unstressed syllables. The unstressed syllables lie on a mid-level line of pitch, with the stress syllables lying on a downward slanted tangent to produce an overall down drift sensation. The user is required to mark stressed syllables in the allophonic code. The stressed syllables then become the anchor point of the pitch patterns. A microprocessor automatically assigns the appropriate pitch values to the allophones which have been strung.
At this point, there exists an inventory of LPC parameters which have been strung together and designated in pitch as set out above. The LPC parameters are then sent to the speech synthesis device, which in this preferred embodiment is the device described in U.S. Patent No. 4,209/636 mentioned earlier and which is incorporated herein by reference. The smoothing mentioned above is accomplished by circuitry on the synthesizer chip. The smoothing could also be accomplished through the microprocessor.
The principal object of this invention is to provide a text-to-speech system with a speech producing system as a component thereof that has unlimited vocabulary in any language.
It is another object of this invention to provide an economic mechanism for producing speech-like sounds that are good in quality, with an unlimited vocabulary, from a textual code input.
Another object of this invention is to provide a text-to- speech system which is low cost in terms of storage and yet provides understandable synthetic speech.
It is still another object of this invention to provide a text-to-speech system which employs a digital, semiconductor integrated circuit LPC synthesizer in combination with concatenated sound input originated through text code to provide an unlimited vocabulary.
A further object of this invention is to provide a stress and intonation pattern to the input textual material so that the pitch is adjusted automatically according to a natural sounding intonation pattern at the output.
An all encompassing object of this invention is to provide a highly flexible, low cost text-to-speech system with the advantages of unlimited vocabulary and good speech quality.
These and other objects will be made evident in the detailed description that follows.

Brief Description of the Drawings

Figure 1 is a block diagram of the inventive text-to- speech system.
Figures 2a-2p illustrate the allophone rules.
Figures 3a-3f form a flowchart illustrating the operation of the rules processor.
Figure 4 is a block diagram of the speech producing system.
Figures 5a-5c form a description of the allophone library.
Figure 6 illustrates the synthesizer frame bit content.
Figure 7 illustrates the allophone library bit content.
Figures 8a and 8b form a flowchart describing the operation of the microprocessor of the system.
Figures 9a-9i form a flowchart describing the intonation pattern structuring.

Detailed Description of the Invention

Figure 1 illustrates the text-tospeech system 10 having a 420 rules processor 17 with a digital character input (ASCII) for comparison to the rules 16 which are stored in a ROM. The 420 rules processor 11 is a Texas Instruments Incorporated Type TMC0420 microcomputer described in detail in appendix A which includes 26 sheets of specification and 9 sheets of drawings. The rules ROM 16 is a Texas Instruments Type TMS6100 (TMC350) voice synthesis memory which is a ROM internally organized as 16Kx8 bits.
The allophonic code retrieved from rules ROM 16 is entered in the system 420 microprocessor 11 which is connected to control the stringer controller 13_'and synthesizer 14. Allophone library 12 is accessed through the stringer controller 13. The output of synthesizer 14 is through speaker 15 which produces speech-like sounds in response to the input allophonic code.
The 350 stringer controller 13 is a Texas Instruments T_MC0356, which is described in detail in Appendix B which comprises 21 specification sheets, and 11 sheets of drawings. Allophone library 12 is a Texas Instruments Type _TMS 6100 also. It may or may not be included because the 356 stringer controller 13 has an internal ROM which may be used to contain the library. The 420 system microprocessor 11 also is a Type TMC0420 microcomputer. Appendices A and B are enclosed herewith and incorporated by reference.
Synthesizer 14 is fully described in previously mentioned United States Patent No. 4,209,836. However, in addition, 236 synthesizer 14 has the facility for selectively smoothing between allophones and has circuitry for providing a selection of speech rate which is not part of this invention.
Figures 2a-2p set out the allophone rules. For example, in the A rules (AW]b in the allophonic code /AW3/ which is pronounced as the "a" in "saw". Note that the "A" sounds are categorized in one group followed by the "B" sounds, etc. These are listed as "A" rules, "B" rules, "C" rules and so on.
Figures 3a-3f form the flowchart detailing the operation of the 420 rules processor 17 in searching the rules ROM 16 for each of the incoming digital characters. The appropriate allophonic code is retrieved and stress is assigned.
Referring first to Fig. 3a, the system is initialized, and the rule file is opened. The 420 rules processor 17 is thereby instructed to read information from the rules ROM 16 and to do the matching. The first character input (in ASCII) code in this preferred embodiment, is shifted to the right and then the first character is skipped. The first character is a space because of the shift to the right and is skipped so that when a comparison is made, it is noted that the neighboring character to the left of the next character is a space and the proper allophonic code can be assigend. Then the next character is read and the question "end of text?" is asked. If the answer is yes, the routine goes to "STRESS" on Figure 3b. If the answer is no, the rules are read out until a match is made. Each rule contains the ASCII characters set to define an allophone and the corresponding allophonic code. When a match is made, the allophonic code is read out to "STRESS" of Figure 3b. Fig. 3b and the next character is obtained.
Coming in through "STRESS" on Figure 3b, a pointer receives the beginning of the display buffer.The pointer also gets the beginning of the allophone buffer. Then the question"?" is asked. If the answer is true, "?" is deleted from queue 1 and it is determined whether the allophone starts with "wh". If the answer is true, then a question bit flag is set. If the answer is "no", the question bit flag is cleared. Then a reset word/ phrase bit is set, and a reset allophone/allophone-bit flag is reset, followed by beginning of allophone buffer sent to the pointer. Figure 3b, it is seen, is dedicated to concatenating the flags.
In Figure 3c, the question is asked if the allophone=OO. If the answer is false, a pointer is incremented until the allophone=OO. When it does, it is determined whether the allophone number is less than 48 hexidecimal. If the answer is true, a vowel is indicated (as shown in the allophone library) and the last vowel gets the value of the pointer. If the answer is false, then the pointer is decremented because the first vowel received will actually be the last vowel.
The pointer receives the beginning of the allophone buffer and the primary receives a 1 with the vowel receiving a O in an initialization process.
In Figure 3d, it is determined whether the allophone is a vowel. If the answer is true, then the vowel number is incremented by 1 and the next allophone is called. If the answer is false, it is then determined whether the allophone is a "A". If the answer is true, a primary stress is indicated and the code for "^" must be eliminated from the assembly queue 1. Then the pointer is incremented and it is determined whether the next allophone is a ">". If the answer is true, ">" is eliminated from queue 1 and it is indicated that the primary stress will be skipped to the next vowel. Therefore, the primary notation is incremented and the pointer is again incremented. If the answer is false, the primary is increased by the sum of primary+vowel to determine which vowel gets the primary stress. If it is determined that there is no "^", then no primary stress is indicated and it is determined whether the allophone is the end of frame. If the answer is false, the pointer is incremented and the routine shown in Fig. b is repeated. If it is an end of frame, then the primary is reset to O and it is determined whether the last vowel receives the primary stress. If the answer is yes, then a vowel bit flag is set. If the answer is no, the vowel bit flag is not set. In either event, the information thus derived (overhead) is sent to queue 2 which is the speaking queue. Next, the pointer is set to the beginning of the allophone buffer. The secondary bit flag is initialized and then, in Figure 3f, it is determined whether the allophone is a "-", indicating a secondary stress. If the answer is true, then the "-" must be removed from queue 1 and the pointer is indexed. Next it is determined whether the following allophone is a ">", indicating that the next vowel is to receive the secondary stress. If the answer is true, then the code for ">" must be deleted from queue 1 and the secondary flag is incremented by 1 and the question whether a skip is to be performed is again asked. If there is no skip, then it is determined whether the allophone is a vowel. If the answer is false, the pointer is incremented by 1 until a vowel is reached. If the answer is true, then the secondary stress flag is decremented by 1 and the question is asked whether the secondary is now equal to O. If the answer is true, a secondary stress flag is set as indicated on Figure 3e. If the answer is false, the pointer is incremented.
If it is indicated that the allophone is the end of the frame, then allophone buffer is down loaded to queue 2, the speaking queue.
Figure 4 is a block diagram of the speech producing system which has been described in association with Figure 1.
Figures 5a through 5c illustrate the allophones within the allophone library 12. For example, allophone 18 is coded within ROM 12 as "AW3" which is pronounced as the "a" in the word "saw". Allophone 80 is set in the ROM 12 as code corresponding to allophone "GG" which is pronounced as the "g" in the word "bag". Pronunciation is given for all of the allophones stored in the allophone library 12.
Each allophone is made up of as many as 10 frames, the frames varying from four bits for a zero energy frame, to ten bits for a "repeat frame" to 28 bits for a "unvoiced frame" to 49 bits for a "voice frame". Fig. 3 illustrates this frame structure. A detailed description is present in previously mentioned United States Patent No. 4,209,836.
In this preferred embodiment, the number of frames in a given allophone is determined by a well-known LPC analysis of a speaker's voice. That is, the analysis provides the breakdown of the frames required, the energy for each frame, and the reflection coefficients for each frame. This information is stored then to represent the allophone sounds set out in Figs. 5a-5c.
Smoothing between certain allophones is accomplished by circuitry illustrated in Figs. 7a and 7a (cont'd) of U.S. Patent No. 4,209,836. In Figs. 7a and 7a (cont'd), signal SLOW _D is applied to parameter counter 513, which causes a frame width of 25 MS to be slowed to 50 MS. Interpolation (smoothing) is performed by the circuitry shown in Figs. 9a, 9a (cont'd), 9b, 9b (cont'd) over a 50 MS period when signal SLOW D is present and over a 25 MS period when signal SLOW D is absent. In the invention of U. S. Patent No. 4,209,836, a switch was set to cause slow speech through signal SLOW D. All frames were lengthened in duration.
In the present invention, SLOW D is present only when the last frame in an allophone is indicated by a single bit in the frame. The actual interpolation (smoothing) circuitry and its operation is described in detail in U.S. Patent No. 4,209,836.

Figure 6 illustrates the bit formation of the allophone frame received by the 286 synthesizer 14. As shown, MSB is is the end of allophone (EOA) bit. When EOA=1, it is the last frame in the allophone. When EOA=O, it is not the last frame in the allophone. Figure 6 illustrates a total of 50 bits (including EOA) for the voiced frame, 29 bits for the unvoiced frame, 11 bits for the repeat frame and 5 bits for the zero energy frame energy equal to 15.
Figure 7 illustrates an allophone frame from the allophone library 12. F1-F5 are each one bit flags with F5 being the EOA bit which is transferred to the 286 synthesizer 14. The combination of flags F1 and F2 and the combination of flags F3 and F4 are shown in Fig. 7 and the meaning of those combinations set out.
Figures 8a and 8b form a flowchart illustrating the details of control exerted by the 420 microcomputer 11 over, primarily, the 356 stringer 13. Beginning at "word/phrase", the first-in, first-out (FIFO) register of the 356 stringer 13 is initialized to receive the allophonic code from 420 microprocessor 11. Next it is determined whether the incoming information is simply a word or a phrase. If it is simply a word, then the call routine is brought up to send flag information representative of allophones, the primary stress and which vowel is the last in the word. The number of allophones is set in a countdown register and the number of allophones is sent to the 356 stringer 13.

The primary stress to be given is sent, followed by the information as to which vowel is the last one in the word. Finally, a send 2 is called to send the entire 8 bits (7 bits allophone, 1 bit stress flag). It should be noted that the previous send routine involved sending only 4 bits.
A send 2 flag is set and a status command is sent to the 356 stringer 13. Then, if the 356 FIFO is ready to receive information, the FIFO is loaded.
Four bits are then sent from the 420 microcomputer 11 queue register to the FIFO of the 356 stringer 13. The queue is incremented and checked to determine whether it has been emptied. If it has been emptied, there is an error. If it has not been emptied, then the send 2 flag is interrogated. If it is not set, then the routine returns to the send 2 call mentioned above. If the flag is set, then it is cleared and the next four bits are brought in to go through the same routine as indicated above.
When the return is made, an execute command is sent to the 356 stringer 13 after which a status command is sent. If the 356 stringer 13 is ready, a speak command is given. If it is not ready, the status command is again sent until the stringer 13 is ready. Then the allophone is sent and the countdown register containing the number of allophones is decremented. If the countdown equals zero, the routine is again started at word/phrase. If the countdown is not equal to zero, then the send 2 routine is again called and the next allophone is brought with the procedure being repeated until the entire word has been completed.
If, a phrase had been sent rather than a word, then and similar to the case of the single word, status flags are sent, and the call routine is sent, indicating first the number of words, then the primary stress, and then the base pitch and the delta pitch. At that point, the routine returns to word/phrase and is identical to that set out above.
Figure 9a-9i form a flowchart of the details of the control of the action of the 356 stringer 13 on the allophones. Beginning in Figure 9a, the starting point is to "read an allophone address" and then to "read a frame of allophone speech data". On path 31 to Figure 9b, a decision block inquiring "first frame of the allophone" is reached. If the answer is "yes", then it is necessary to decode the flags F1-F5. If the answer is "no", then it is necessary to only decode flags F3, F4 and F5. As indicated above, flags F1 and F2 determine the nature of the allophone and need not be further decoded. After the decoding, in either case, a decision block is reached where it is necessary to determine whether F3 F4=₀₀. If the answer is "yes" then the energy is O and a decision is made as to whether F5=1, indicating the last frame in the allophone. If the answer is yes, then the decision is reached as to whether it is the last allophone. If the answer is "yes", the routine has ended. If F5 is not equal to 1, then E=O is sent to the 286 synthesizer 14 and the next frame is brought in as indicated on Figure 9a. If F5=1, and it is not the last allophone, then the information =O and F5=1 is sent to the 286 synthesizer 14 and the next allophone is called starting at the beginning of the routine.
If F3 F4 is not equal to 00, then it is determined whether F3 F4=01, indicating a 9 bit word because a repeat, using the same K parameters, is to follow. If the answer is "no", then on path 32 to Fig. 6c, it is determined whether F3 F4=10, indicating 27 bits for an unvoiced frame. If the answer is "yes", the first four bits are read as energy. Five bits for pitch are created as O and the next four bits are read as K1-K4. Then energy and pitch=O and K1-K4 are sent to the 286 synthesizer 14. If F3 F4≠10, then F3 F4=11 indicating a voiced 48 bit frame and the first four bits are read as energy, the next five bits are created as pitch and the ten K parameters are read.
Turning to Fig. 9b, if it was determined that F3 F4=01, then on path 33 into Fig. 9c, the next four bits are read as energy, five bits space is created for pitch and repeat (R)=1. At this point, if F3 F4-11 or if F3 F4=01, a pitch adjustment is to be made. The inquiry "base pitch=_O?" is made. If the answer is "yes", then the speech is a whisper and pitch is set to O. At that point, energy and pitch=O and K1 to K4 is sent to the 286 synthesizer 14. The next frame is brought in as indicated on Fig. 9a.
If the base pitch#O, then a decision is made as to whether the delta pitch=_O. If the answer is "yes", then the pitch is made equal to the base pitch. The energy, and pitch equal to the monotone base pitch, and the parameters K1-K10 are sent to the 286 synthesizer 14 and the next frame is brought in.
If the delta pitch#O, then on path 34 into Fig. 9e, it is determined whether F1 F2=00, indicating a vowel. If the answer is "yes", then the question "a primary in the phrase" is asked. If the answer is "no" it is asked whether there is a secondary in the phrase. If the answer is "no", then the vowel is unstressed and the question is asked "is this vowel before the primary stress". If the answer is "no", then on path 38 to Fig. 9e, the decision is made as to whether this is the last vowel. If the answer is "no", then the decision is made as to whether it is a statement or a question type phrase. If the answer is that it is a statement, the decision is made to determine whether it is immediately after the primary stress. If the answer is "no", then the pitch is made equal to the base pitch and on path 51 to Fig. 9i, it is seen that path 40 returns to Fig. 9g where it is indicated that all parameters are sent to the 286 synthesizer 14 for reading and another frame is brought in. This particular path was chosen because of its simplicity of explanation. The multitude of remaining paths shown illustrate in great detail the selection of pitch at the required points.
The assignment of descending or ascending base pitch is shown in Figure 9h. Path 37 from Fig. 9d indicates that there is a primary stress in the particular string and if it is the last vowel, then it is determined whether the phrase is a question or statement. If it is a question, it is determined whether it is the first frame of the allophone. If the answer is "yes", then pitch is assigned as indicated equal to BP+D-2. If it is a statement, and it is the first frame, then pitch is assigned as BP-D+2. This assignment of pitch is set out in Section 4.6 of appendix.B.

Mode of Operation

The operation of this invention is primarily shown in Figures 3a-3f, 8a-8b and 9a-9i. In broad terms, however, the text-to-speech system accepts ASCII code, looks up the appropriate allophonic code in the allophone rules, and assigns stress and pitch. The allophonic code is then received through the 420 microprocessor 11 shown in Figure 1. The code received is related to an address in the allophone library 12. The code is sent by the 420 microprocessor 11 to 356 stringer 13 where the address is read and the allophone is brought out when handled as indicated in Figures 9a-9i. The basic control by the 420 microprocessor 11 in causing the action by the 356 stringer 13 is shown in Figures 8a-8b. The 286 synthesizer 14 receives the allophone parameters from the 356 stringer 13 and forms an analog signal representative of the allophone to the speakter 15 which then provides speech-like sound.
The inventive speech producing system, in its preferred embodiment, describes an LPC synthesizer on an integrated circuit chip with LPC parameter inputs provided through allophones read from the allophonic library. It is of course contemplated that other waveform encoding types of code inputs may be used as inputs to a speech synthesizer. Also, the specific implementation shown herein is not to be considered as limiting. For example, a single computer could be used for the functions of the microcomputer, the allophone library, and the stringer of this invention without departing from its scope..The breadth and scope of this invention is limited only by the appended claims.

Claims

1. A text-to-speech synthesis system, having audible output means, for synthesizing speech from digital characters comprising:

(a) means for receiving the digital character;

(b) speech unit rule means, for storing parameter encod- ; ing, coded signals corresponding to the digital characters;

(c) rules processor means, for searching the speech unit rule means to provide parameter encoding,, coded signals corresponding to the digital characters; and

(d) speech producing means, connected to receive the coded signals and to produce speech-like sound.

2. The system of claim 1 wherein the speech unit rule means comprises digital storage means.

3. The system of claim 2 wherein the digital storage means comprises a read-only memory.

4. The system of claim 3 wherein the rules processor means comprises a rules microprocessor.

- 5. The system of claim 4 wherein the speech unit rule means further comprises a plurality of rules stored in a common section of the read-only memory for each of the digital characters that may be input to the system.

6. The system of claim 5 wherein the rules comprise units of speech representative of the digital characters that are each assigned a parameter encoding code determined by the character and the neighboring characters on each side.

7. The system of claim 5 wherein the rules microprocessor comprises means for searching a common section corresponding to an input digital character set until a match is found and for providing the assigned code for the matched character set.

8. The system of claim 3 wherein the speech producing means comprises:

(d) (i) parameter encoding library means, responsive to the coded signals for providing digital signals representative of parameters of unit of speech;

(ii) means for concatenating the digital signals for designating stress and intonation patterns and for designating pitch to the unit of speech wherein the unit of speech comprises a plurality of frames and wherein pitch is designated for each frame;

(iii) LPC speech synthesizing means, for receiving the digital signals and for providing analog signals, corresponding to the digital signals, to audible output means to produce speech-like sounds with stress and intonation; and

(iv) smoothing means for selectively smoothing the transitions between the units of speech.

9. The system of claim 8 wherein the parameter encoding library means comprises a read-only-memory having storage addresses corresponding to the respective coded signals, the contents at each address including parameters of a unit of speech.

10. The system of claim 3 wherein the speech producing means comprises:

(d) (i) parameter encoding library means, responsive to the coded signals for providing digital signals representative of parameters of units of speech;

(ii) means for concatenating the digital signals and for designating stress and intonation patterns; and

(v) semiconductor integrated circuit, LPC speech synthesizing means, for receiving the digital signals and for providing analog signals, corresponding to the digital signals, to audible output means to produce speech-like sounds with stress and intonation.

11. The system of claim 10 wherein the parameter encoding library means comprises a read-only memory having storage addresses corresponding to the respective coded signals, the contents at each address including parameters of a unit of speech.

12. A text-to-speech synthesis system, having audible output means, for. synthesizing speech from digital characters, comprising:

(a) means for receiving the digital characters;

(b) allophone rule means, for storing allophonic code signals coresponding to the digital characters;

(c) rules processor means, for searching the allophone rule means to provide an allophonic code corresponding to the digital characters; and

13. The system of claim 12 wherein the allophone rule means comprises digital storage means.

14. The system of claim 13 wherein the digital storage means comprises a read-only memory.

15. The system of claim 14 wherein the rules processor means comprises a rules microprocessor.

16. The system of claim 15 wherein the speech allophone rule means further comprises a plurality of rules stored in a common section of the read-only memory for each of the digital characters that may be input to the system.

17. The system of claim 16 wherein the rules comprise allophonic codes representative of the digital character sets that are each assigned an allophonic code determined by the character set.

18. The system of claim 16 wherein the rules microprocessor comprises means for searching a common section corresponding to an input digital character set until a match is found and for providing the assigned code for the matched character set.

19. The system of claim 14 wherein the speech producing means comprises:

(d) (i) allophone library means, responsive to the allophonic code for providing digital signals representative of allophones, corresponding to the allophonic code;

(ii) means for concatenating the digital signals, for designating stress and intonation patterns and for designating pitch to the allophone wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame;

(iii) speech synthesizing means, for receiving the digital signals and for providing analog signals corresponding to the digital signals to audible output means to produce speech-like sounds with stress and intonation; and

(iv) smoothing means for selectively smoothing the transition between the allophones.

20. The system of claim 19 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone.

21. The system of claim 14 wherein the speech producing means comprises:

(d) (i) allophone library means, responsive to the allophonic code for providing digital signals representative.of allophones, corresponding to the allophonic code;

(v) semiconductor, integrated circuit speech synthesizing means, for receiving the digital signals and for providing analog signals corresponding to the digital signals to the audible output means to produce speech-like sounds with stress and intonation.

22. The system of claim 21 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the content in each address including parameters of an allophone.

23. The system of claim 14 wherein the speech producing means comprises:

(iii) semiconductor integrated circuit, LPC speech synthesizing means for'receiving the digital signals and for providing analog signals = corresponding to the allophones to the audible output means to produce speech-like sounds with stress and intonation.

24. The system of claim 23 further comprising smoothing means for selectively smoothing the transition between the allophones.

25. The system of claim 24 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone.

26. The system of claim 24 wherein the means for concatenating comprises means for designating pitch to the allophone.

27. The system of claim 26 wherein the means for designating pitch includes means for designating a base pitch as modified by an operator-inserted coded primary or secondary stress.

28. The system of claim 27 wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame.

29. The system of claim 28 wherein the base pitch comprises a descending gradient for a statement and an ascending gradient for a question.

30. The system of claim 29 wherein the means for designating pitch includes means for designating a delta pitch for limiting the amplitude of the primary or secondary stress.

31. The system of claim 30 wherein each frame comprises a signal indicating whether or not the frame is the end of the allophone.

32. The system of claim 31 wherein the smoothing means comprises means for selectively inserting an additional frame after the last frame in the allophone.

33. The system of claim 32 wherein the smoothing means further comprises means for identifying the current allophone and the subsequent allophone as voiced or unvoiced, or stop.

34. The system of claim 33 wherein the means for selectively inserting an additional frame is activated when no stop is present, and the current allophone and the subsequent allophone are both voiced or both unvoiced.

35. A method for producing speech from digital characters in a system having a semiconductor, integrated circuit, speech synthesizer and audible output means, comprising:

(a) storing rules for units of speech;

(b) searching the rules for a match to an input digital character set;

(c) providing a coded signal for the matched character set;

(d) storing digital signals representative of parameters of units of speech;

(e) reading out the digital signals at the addresses corresponding to the coded signals;

(f) concatenating the digital signals read out;

(g) digitally coding desired pitch and intonation to the concatenated digital signals; and

(h) transmitting the concatenated digital signals to the speech synthesizer for speech synthesis and application to the audible output means to produce speech.

36. The method of claim 35 further comprising the step of, following step (g), selectively smoothing transitions between units of speech.

37. The method of claim 36 wherein the units of speech are allophones.

38. An electronic, speech producing system for receiving parameter encoding, coded signals and for producing speech-like sounds corresponding to the coded signals, via audible output means, comprising:

(a) parameter encoding library means, responsive to the coded signals for providing digital signals representative of parameters of units of speech;

(b) means for concatenating the digital signals for designating stress and intonation patterns and for designating pitch to the unit of speech wherein the unit of speech comprises a plurality of frames and wherein pitch is designated for each frame;

(c) LPC speech synthesizing means, for receiving the digital signals and for providing analog signals, corresponding to the digital signals, to the audible output means to produce speech-like sounds with stress and intonation; and

(d) smoothing means for selectively smoothing the transitions between the units of speech.

39. The electronic speech producing system of claim 38 wherein the parameter encoding library means comprises a read-only memory having storage addresses corresponding to the respective coded signals, the contents at each address including parameters of a unit of speech.

40. An electronic, speech producing system for receiving parameter encoding, coded signals and for pro- ducirig speech-like sounds corresponding to the coded signals, via audible output means, comprising:

(b) means for concatenating the digital signals and for designating stress and intonation patterns; and

(c) semiconductor integrated circuit, LPC speech synthesizing means, for receiving the digital signals and for providing analog signals, corresponding to the digital signals, to the audible output means to produce speech-like sounds with stress and intonation.

41. The system of claim 40 further comprising:

42. The electronic speech producing system of claim 41 wherein the parameter encoding library means comprises a read-only memory having storage addresses corresponding to the respective coded signals, the contents at each address including parameters of a unit of speech.

43. The system of claim 41 wherein the means for concatenating comprises means for designating pitch to the unit of speech.

44. The system of claim 43 wherein the unit of speech comprises a plurality of frames and wherein pitch is designated for each frame.

45. An electronic, speech producing system for receiving allophonic code and for producing speech-like sounds corresponding to the allophonic code via audible output means, comprising:

(a) allophone library means, responsive to the allophonic code for providing digital signals representative of allophones, corresponding to the allophonic code;

(b) means for concatenating the digital signals, for designating stress and intonation patterns and for designating pitch to the allophone wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame;

(c) speech synthesizing means, for receiving digital signals and for providing analog signals corresponding to the digital signals to the audible output means to produce speech-like sounds with stress and intonation; and

(d) smoothing means for selectively smoothing the transition between the allophones.

46. The system of claim 45 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone.

47. An electronic, speech producing system for receiving allophonic code and for producing speech-like sounds corresponding to the allophonic code via audible output means, comprising:

(c) semiconductor, integrated circuit speech synthesizing means, for receiving the digital signals and for providing analog signals corresponding to the digital signals to the audible output means to produce speech-like sounds with stress and intonation.

48. The system of claim 47 further comprising:

49. The system of claim 48 wherein the allophone: library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone.

50. The system of claim 48 wherein the means for concatenating comprises means for designating pitch to the allophone.

51. The system of claim 46 wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame.

52. An electronic speech producing system for receiving allophonic code and for producing speech-like sounds corresponding to the allophonic code via audible output means, comprising:

(b) means for concatenating the digital signals, for designating stress and intonation patterns, and for designating pitch to the allophone, wherein the means for designating pitch includes means for designating a base pitch as modified by an operator-inserted, coded primary or secondary stress;

(c) LPC speech synthesizing means for receiving the digital signals and for providing analog signals representative of the allophone to the audible output means to produce speech-like sounds with stress and intonation; and

53. The system of claim 52 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone.

54. The system of claim 53 wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame.

55. The system of claim 54 wherein the base pitch comprises a descending gradient for a statement and an ascending gradient for a question.

56. The system of claim 55 wherein the means for designating pitch includes means for designating a delta pitch for limiting the amplitude of the primary or secondary stress modification.

57. An electronic speech producing system for receiving allophonic code and for producing speech-like sounds corresponding to the allophonic code via audible .output means, comprising:

(c) semiconductor integrated circuit, LPC speech synthesizing means for receiving the digital signals and for providing analog signals corresponding to the allophones to the audible output means to produce speech-like sounds with stress and intonation.

58. The system of claim 57 further comprising:

59. The system of claim 58 wherein the allophone library means comprises a read-only memory having storage addresses corresponding to the respective allophonic code, the contents at each address including parameters of an allophone

60. The system of claim 58 wherein the means for concatenating comprises means for designating pitch to the allophone.

61. The system of claim 60 wherein the means for designating pitch includes means for designating a base pitch as modified by an operator-inserted, coded primary or secondary stress.

62. The system of claim 61 wherein the allophone comprises a plurality of frames and wherein pitch is designated for each frame.

63. The system of claim 62 wherein the base pitch comprises a descending gradient for a statement and an ascending gradient for a question.

64. The system of claim 63 wherein the means for designating pitch includes means for designating a delta pitch for limiting the amplitude of the primary or secondary stress.

65. The system of claim 64 wherein each frame comprises a signal indicating whether or not the frame is the end of the allophone.

66. The system of claim 65 wherein the smoothing means comprises means for selectively inserting an additional frame after the last frame in an allophone.

67. The system of claim 66 wherein the smoothing means further comprises means for identifying the current allophone and the subsequent allophone as voiced or unvoiced, or stop.

68. The system of claim 66 wherein the means for selectively inserting an additional frame is activated when no stop is present, and the current allophone and the subsequent allophone are both voiced or both unvoiced.

69. A method of producing speech from coded signals in a system having a semiconductor, integrated circuit, LPC speech synthesizer and audible output means comprising the steps of:

(a) storing digital signals representative of parameters of units of speech;

(b) reading out the digital signals at the addresses corresponding to respective coded signals;

(c) concatenating the digital signals read out;

(d) digitally coding desired pitch and intonation to the concatenated digital signals; and

(e) transmitting the concatenated digital signals and the digital coding to the speech synthesizer for speech synthesis and application to the audible output means to produce speech.

70. The method of claim 69 further comprising the step of, following step (d), selectively smoothing transitions between units of speech.

71. The method of claim 70 wherein the units of speech are allophones.