Audio Experiment in Measurement vs. Perception

Caltech Music Lab

Our Audio Experiment
Gerald Jay Sussman, MIT
Keith Winstein, MIT
James Boyk, Caltech
February 12, 2002
Copyright © GJS, KW, JB 2002

Although modern audio circuitry achieves outstanding performance as defined by easily measured electrical criteria, experienced listeners often report that designs with much poorer measured specifications yield more "musically accurate" sound; or that two designs with almost identical "specs" nevertheless sound different. This anomaly is disturbing to those of us who believe that accurate reproduction of sound must be correlated with accurate reproduction of waveforms. At the same time, we recognize that the mapping of waveform error to perceptual degradation is not understood; so the "anomaly" may be more apparent than real.

Our ignorance of the physical correlates of auditory perception is profound. For example, in every human language, most of the meaning is conveyed in the consonants, but most of the acoustical energy is concentrated in the vowels. This is a suspicious feature to a communications engineer. One would expect that the energy would be concentrated in the part of the signal with the highest information content. We seem to be extremely sensitive to subtle sounds. This is apparent, for instance, in the fact that a human can walk slowly around a darkened room without fear of walking into a wall, because in the region near a wall, a room's background noise has a characteristic sound!

The purpose of our study is to find measurements that correlate with specific perceptual cues, and ways that these measurements are determined by circuit structures or other aspects of design. This knowledge should make possible better recording and playback. We hope that it will also yield important data about the nature of auditory perception.

Our approach is to compare two specially-prepared microphone preamplifiers--both technically excellent, but very different in circuitry--using conventional electrical criteria and also listening tests. We record identical microphone signals, obtained by splitting the output of a single microphone to feed both preamplifiers at once. We record the outputs of the two preamplifiers as two channels using a high-bit digital recorder. The recordings are then compared by careful, blind listening, and are annotated to describe the characteristic features of their sounds. They are also examined using computational tools, to determine the analytic differences between them that might cause the differing perceptions. We then prepare new synthetic recordings in which the sound of each preamplifier is modified, by incorporating specific differences found through computational analysis, in an attempt to make it sound like the other unit. In this way we hope to discover which differences matter to perception and, if possible, correlate particular differences with particular aspects of perception.

In more detail:

We take two mike preamplifiers of different design, both of which have good technical performance. We make two mono recordings by feeding both preamplifiers simultaneously from one microphone. We record these signals in 24-bit format with a 96kHz sampling frequency. One or more experienced listeners audition the recordings and verbally characterize the sonic differences. Notes are made of these comments. It is essential that the recordings--call them P and Q--be reliably distinguishable by the listeners. If this is not the case, either new recordings must be made with better equipment (such as better analog-digital converters); or better playback equipment must be obtained so that the existing recordings can be heard in more detail.

We then study the digital recordings with computer and characterize the differences between them analytically in terms of specific factors like frequency response, etc. We pay special attention to differences noted by the listeners, whether these apply overall or at specific spots. In this part of the work, the test of whether all relevant differences have been found is whether everything has been accounted for down to the level of the background noise.

We then synthesize various new recordings from the originals. One is a recording of the difference between the original P and Q recordings (appropriately time- and level-shifted to make them as close as possible). If P and Q were identical the difference would always be zero and the difference-recording would be totally silent. What one hears instead of silence is interesting! Last June, for instance, we recorded clarinetist Margaret Thornhill; and even the difference recording sounded like a clarinet. This implies that the redundant "clarinet-identifying information" in P, or the deficiency in such information in Q, or the two together, added up to enough information for the clarinet to be identified in the difference recording. As Q itself did not sound obviously *unlike* a clarinet, presumably P had a lot of redundancy. (On the other hand, a listener familiar with Ms. Thornhill's playing could not tell from the difference recording that she was the performer.)

We also create two more new recordings where we use the analytic differences in an attempt to make each preamplifier sound like the other: We apply the individually-characterized differences to the P recording to make it sound like Q; and apply the opposite differences to Q to make it sound like P. This gives four recordings:

          P
          Q
          P masquerading as Q
          Q masquerading as P

One or more experienced listeners audition all four to see if they can tell the originals from the masquerades. If P can be told from Q-masquerading-as-P, and Q can be told from P-masquerading-as-Q, that would mean that the masquerades are not good enough. In other words, something not identified in the analysis matters to the sound. Since the analysis was carried out down to the level of background noise, this would mean that perception is affected by something down in the noise. (There's nothing inherently impossible about this.)

If, on the other hand, the masquerades cannot be distinguished from the originals, it means that either the analytic differences do account fully for the subjective difference in sound or the resolution of the playback system isn't adequate for listeners to compare the recordings in enough detail. (Imagine that equipment of outright bad quality were used to play the recordings. Listeners might indeed be unable to tell them apart, but this would not mean they were identical. It would only mean that better playback equipment was needed.)

We also make two other recordings:

P minus (Q-masquerading-as-P)
Q minus (P-masquerading-as-Q)

If removing the analytically-characterized differences from the difference signal-- which is what these recordings do--leaves only background noise, then in one sense everything's been accounted for; but the crucial problem remains of correlating those differences with the differences perceived by listeners and described verbally. When three listeners agree that one recording of soprano has "more tonal depth" than the other, or is "rounder"--as happened in our most recent sessions--what does this mean in terms of measurement? Perhaps another "basis set" of measurements is needed, one which will equally account for the analytic difference while correlating better with perception.

That we're comparing mike preamplifiers is almost accidental. Other components could be compared as easily. The real subject is correlating measurement with perception.