A Review of The Cocktail Party Effect PART # 3 Barry Arons MIT - TopicsExpress



          

A Review of The Cocktail Party Effect PART # 3 Barry Arons MIT Media Lab 20 Ames Street, E15-353 Cambridge MA 02139 [email protected] Review of The Cocktail Party Effect Barry Arons MIT Media Lab 20 Ames Street, E15-353 Cambridge MA 02139 [email protected] segregation Segregation that is learned, or involves attention, is considered to be based on a higher level of central processing. Anything that is consciously ``listened for is part of a schema. Recall from the findings of the earlier studies, that only a limited number of things can be attended to simultaneously, so there is a limitation on our ability to process schemas. Primitive segregation is symmetrical. When it separates sounds by frequency (or location), we can attend to either high tones or low tones (left or right) equally well. Schema-based recognition is not symmetrical. If your name is mixed with other sounds it may be easy to recognize it in the mixture, but it does not make it easier to identify the other elements of the sound. An example of the use of schema-based reasoning involves the simultaneous presentation of two synthetic vowels. The vowels were produced such that they had the same fundamental, the same start and stop time, and came from the same spatial location. All the primitive preattentive clustering theories suggest that these complex sounds should be fused into a single stream. However, higher level schema are used to distinguish the vowels in this mixture. Bregman suspects that the schema for each vowel is picking out what it needs from the total spectrum rather than requiring that a partitioning be done by the primitive processes. There is also evidence that a scene that has been segregated by primitive processes can be regrouped by schemas. For example, a two-formant speech sound was synthesized with each formant constructed from harmonics related to a different fundamental. Listeners will hear two sounds, one corresponding to each related group of harmonics, yet at the same time, they will perceive a single speech sound formed by the complete set of harmonics. The speech recognition schemas thus can sometimes combine evidence that has been segregated by the primitive process. Speech Scene Analysis In addition to the grouping processes already mentioned, there are additional extensions and ideas that are specific to the analysis of speech signals. Note that it is often difficult to separate primitive processes from schema, and that speech schemas tend to obscure the contributions of primitive processes. Considering the primitive segregation rules, it is somewhat surprising that voices hold together at all. Speech consists of sequences of low frequency complex tones (vowels) intermixed with high frequency noise (fricatives). With a production rate of roughly 10 phonemes/sec, speech should break up into two streams of alternating high and low tones. Listeners are able to understand and repeat a rapid sequence of speech, but are not able to report the order of short unrelated sounds (e.g., a hiss, buzz, etc.) played in sequence, even if they are played at a much slower rate than the corresponding phonemes. Warren argues that listeners to a cycle of unrelated events have to decompose the signal into constituent parts, recognize each part, and then construct a mental representation of the sequence. Listeners to speech do not have to go through this process--they can do some global analysis of the speech event and match it to a stored representation of the holistic pattern. After all, Warren continues, children can recognize a word and often have no idea of how to break it up into its constituent phonemes. ([Bre90] page 534) Pitch Trajectory. In general, the pitch of a speakers voice changes slowly, and it follows melodies that are part of the grammar and meaning of a particular language. Listeners use both constraints to follow a voice over time. In shadowing experiments two interesting results were shown. First, if the target sound and the rejected sound suddenly switched ears, the subjects could not prevent their attention from following the passage (rather than the ear) that they were shadowing. The author of the original research argued that ``the tracking of voices in mixtures could be governed by the meaning content of the message. Secondly, if only the pitch contour was switched between ears, subjects often repeated words from the rejected ear, even if the semantic content did not follow. The continuity of the pitch contour was, to some degree, controlling the subjects attention. Spectral Continuity. Since the vocal tract does not instantaneously move from one articulatory position to another, the formants of successive sounds tend to be continuous. These coarticulatory features provide spectral continuity within and between utterances. Continuities of the fundamental and the formant frequencies are important at keeping the speech signals integrated into a single stream. Pitch-based Segregation. It is harder to separate two spoken stories if they both have the same pitch [BN82]. By digitally re-synthesizing speech using LPC analysis, it is possible to hold the pitch of an utterance perfectly constant. It was found that as the fundamentals of two passages were separated in frequency, the number of errors decreased (footnote-3). It was reported that at zero semitones separation, one hears a single auditory stream of garbled, but speech-like sounds, at one half semitone one very clearly hears two voices, and it is possible to switch ones attention from one to the other. Note that a fundamental of 100 Hz was used, and that a half of a semitone (1/12 octave) corresponds to a factor of only 1.03 in frequency. In another experiment, with a fundamental pitch difference of only 2 Hz for a synthesized syllable, virtually all subjects reported that two voices were heard. At a difference of 0 Hz, only one voice was reported. Harmonics. On a log scale, speech harmonics move up and down in parallel as the pitch of an utterance changes. Harmonics that maintain such a relationship are probably perceived to be related to the same sound source. There is also evidence that supports the idea that changing harmonics can be used to help ``trace out the spectral envelope of the formant frequencies for speech. Two adjacent harmonic peaks can be connected by more than one spectral envelope. However, by analyzing the movement of the peaks as the fundamental changes, it is possible to unambiguously define the formant envelope. Automatically Recognizing Streams While this paper focuses on what attributes of the cocktail party effect can be used for enhancing user interfaces that present speech information to the user, it is worth considering the recognition problem briefly. It is generally difficult to find tractable and accurate computational solutions to recognition problems that humans find simple (e.g., speech or image comprehension). We want to understand the segregation of speech sounds from one another and from other sounds for many practical as well as theoretical reasons. For example, current computer programs that recognize human speech are seriously disrupted if other speech or nonspeech sounds are mixed with the speech that must be recognized. Some attempts have been made to use an evidence-partitioning process that is modeled on the one used by the human auditory system. Although this approach is in its infancy and has not implemented all the heuristics that have been described in the earlier chapters of this book, it has met with some limited success. ([Bre90] page 532) In 1971, researchers at Bell Labs reported on a signal processing system for separating a speech signal originating at a known location from a background of other sounds [MRY71]. The system used an array of four microphones and simple computational elements to achieve a 3-6 dB noise suppression. This scheme was somewhat impractical, as the source had to remain exactly centered in the microphone array. It was proposed that an ultrasonic transmitter could be carried, so that the system could track the speaker. Recent work in beam-forming signal-seeking microphone arrays appears promising, though much of the effort is geared toward teleconferencing and auditorium environments [FBE90]. With three microphones it is possible to reject interfering speech arriving from non-preferred directions [LM87] Bregman discusses several systems based primarily on tracking fundamentals for computationally separating speakers (see also [Zis90]). This scheme is somewhat impractical because not all speech sounds are voiced, and the fundamental frequency becomes difficult to track as the number of speakers increases. Weintraub found improvements in speech recognition accuracy in separating a stronger voice from a weaker one [Wei86]. Keep in mind that much of the speech segregation task performed by humans is based in part on knowledge of the transition probabilities between words in a particular context. The use of this technique is feasible for limited domain tasks, but it is unlikely to be computationally tractable for any large domains in the near future. Stream Segregation Synthesis There has been a recent surge of work in the area of real-time three-dimensional auditory display systems [Coh90]. This activity has been partially motivated by the availability of inexpensive digital signal processing hardware and the great interest in ``virtual environments and teleoperator systems. A contributing factor has also been advances in understanding of human spatial hearing and computational ability to synthesize head-related transfer functions (HRTFs; directionally sensitive models of the head, body, and pinna transfer functions) [Bla83]. These systems usually rely on the use of stereo headphones, and synthesize sounds that are localized outside of the head. The fundamental idea behind these binaural simulators is that in addition to creating realistic cues such as reflections and amplitude differences, a computational model of the person-specific HRTF simulates an audio world [WWF88]. Multiple sound sources, for example, can be placed at virtual locations allowing a user to move within a simulated acoustical environment. The user can translate, rotate, or tilt their head and receive the same auditory cues as if a physical sound source were present. These systems provide a compelling and realistic experience and may be the basis for a new generation of advanced interfaces. Current research focuses on improving system latency, the time required to create user-specific HRTF models, and in the modeling of room acoustics. A different approach to the synthesis of auditory streams has been developed by the Integrated Media Architecture Laboratory at Bellcore in the context of a multiperson multimedia teleconferencing system [LPC90]. This ``audio windowing system primarily uses off-the-shelf music processing equipment to synthesize, or enhance, many of the primitive segregation features mentioned in previous sections. Filters, pitch shifters, harmonic enhancers, distortions, reverberations, echos, etc. were used to create ``peer and ``hierarchical relationships among several spoken channels. While the use of these ``rock-n-roll effects may seem extreme, a recent description of the work discusses the use of ``just noticeable effects that are barely over edge of perceptibility [CL91]. Similar effects are used for ``highlighting pieces of audio to draw ones attention to it. Unfortunately, the combination of auditory effects needed to generate these relations appears to have been chosen in a somewhat ad hoc manner, and no formal perceptual studies were performed. The work is important, however, in that it has begun to stimulate awareness in the telecommunications and research communities regarding the feasibility of simultaneously presenting multiple streams of speech in a structured manner.
Posted on: Sun, 07 Sep 2014 20:52:33 +0000

Trending Topics



Recently Viewed Topics




© 2015