Timbre Personalization for Earphones

2025.11.11

Technische Begriffe Vol.7 "ASMR" Du liest Timbre Personalization for Earphones 20 Minuten Weiter Interview Series with Experts in Audio Engineering, Acoustics, and the Arts — Vol. 4–7

At final, research has been conducted to achieve a natural timbre, leading to the development of a proprietary technology called Timbre Personalization. Once this technology matured to the point of practical implementation, it was first applied to the ZE8000 earphones, leading to the launch of the JDH (Jibun Dummy Head) service.

Although ZE8000 JDH received highly favorable evaluations, the service was available to only a limited number of customers. Development was therefore advanced toward broader accessibility, which ultimately led to the integration of this technology into TONALITE, our new-generation true wireless stereo (TWS) earphones.

In this R&D article, we explain the core technology underlying this approach—Timbre Personalization for Earphones—and outline the technical framework that supports its realization.

Natural Listening

Figure 1 : Listening to a live violin performance

As illustrated in Figure 1, if we assume that the acoustic wave radiated into space by a live violin performance reaches every listener under physically identical conditions—that is, each listener receives the same sound waveform—then all listeners are, in essence, hearing the same sound. We define this condition as natural listening.

In this state, how each individual perceives the sound—whether one finds it pleasant or not—belongs to the domain of personal preference, which lies beyond the scope of this discussion. In this paper, we strictly distinguish natural listening, as a physical and perceptual condition, from subjective impression, which depends on individual preference. This distinction is crucial for understanding Timbre Personalization for Earphones.

Figure 2 : Listening to two-channel stereophonic playback

Similarly, as shown in Figure 2, if we construct a two-channel stereophonic playback system using a pair of loudspeakers, and if each listener sits—one by one—at the same optimal listening position with identical head and ear alignment, then a natural listening condition can again be assumed for all listeners.

However, even under such conditions, the physical characteristics of the sound waves reaching each listener's eardrums differ depending on the individual's body geometry, including the head, torso, and pinnae. Despite these differences, natural listening is still achieved in the situations illustrated in Figures 1 and 2.

This phenomenon can be explained by the model of the auditory system under natural conditions, as shown in Figure 3, initially presented by Guenther Theile [1]. The physical modification of the incoming sound wave by the outer ear is effectively compensated after directional auditory processing—that is, it is deconvolved through an inverse-filtering mechanism within the auditory system. Consequently, the outer ear's influence on arriving sound waves does not affect the Gestalt of auditory events such as timbre perception [1].

Figure 3 : Operation of the auditory system under natural conditions [1]

Now, if we listen to the same two-channel stereophonic audio signals through earphones that would normally be fed to the loudspeakers in Figure 2, can we still say that natural listening is achieved? The answer is no.

The Target Curve of an earphone can be understood as an engineering attempt to move this "no" closer to "yes." Furthermore, Timbre Personalization for Earphones aims ultimately to turn that "no" into a definitive "yes."

Target Curve

The Target Curve is the desired amplitude–frequency response specified during earphone design. It is also sometimes called the Target Response Curve, but in this article, we consistently use Target Curve.

The target curve serves as an engineering means to realize, as closely as possible, the condition of natural listening described earlier, and it is widely employed in the acoustic design of earphones.

Extensive research on target curves has produced several well-known reference curves. In this section, we introduce one of the fundamental concepts used to derive target curves—the concept of Insertion Gain—as well as two widely used reference models: the Free-Field Target Curve and the Diffuse-Field Target Curve.

Figure 4 : Concept of Insertion Gain [2]

Figure 4 illustrates the concept of Insertion Gain [2].

• Curve A represents the free-field measurement of an ideal loudspeaker having a flat amplitude response across frequency.
• Curve B shows the response measured at the eardrum position of a subject when the same loudspeaker is placed in front of the listener and emits the test signal.
• Curve C represents the response measured at the eardrum position when the same subject wears a given earphone and reproduces the identical test signal.
• Curve D, obtained by subtracting B from C, is called the Insertion Gain.

If the insertion gain (D) is flat across frequency, the sound reproduced by the earphone will produce at the eardrum the same physical waveform as the sound from the ideal loudspeaker. In other words, the Insertion-Gain Concept assumes that by designing an earphone whose insertion gain is constant with frequency, one can reproduce through earphones the same perceived sound as that produced by an ideal free-field loudspeaker.
In Figure 4, curve D is flat between 200 Hz and 8 kHz; therefore, the earphone used in this measurement can be considered to have a target curve that conforms to the insertion-gain concept, with curve C representing that target.

To implement the insertion-gain concept in an actual product, one must define the acoustic environment in which such measurements are conducted. The two most common reference environments are the free-field and diffuse-field environments. In practice, an anechoic chamber is typically used to approximate a free field, while a reverberation chamber serves as a model for a diffuse field.

However, it is essential to note that the actual acoustic characteristics of laboratory anechoic and reverberant rooms do not perfectly match the ideal definitions of free or diffuse fields. Figures 5 and 6 show examples of the Free-Field Target Curve and the Diffuse-Field Target Curve, respectively.

Figure 5 : Free-Field Target Curve

Figure 6 : Diffuse-Field Target Curve

General Target Curve and Timbre-Personalized Target Curve

Here, we define the target curves described above as General Target Curves. Designing earphone responses based on the Insertion-Gain Concept has been a standard practice among many manufacturers, both historically and today.

Various new target curves have also been proposed, reflecting different research approaches and psychoacoustic perspectives, and some have been adopted in commercial products [3]. A detailed discussion of these variations is beyond the scope of this article.

A General Target Curve represents a single, universal response applied identically to all units of a given earphone model. In contrast, the target curve derived through the method described in this paper—Timbre Personalization for Earphones—is individualized: even for the same earphone model, each user obtains a distinct target curve optimized for their own acoustic and morphological characteristics.

Therefore, we refer to this individualized response as the Timbre-Personalized Target Curve.

What Personalization Means

Personalization is the process of adapting the physical characteristics of devices such as earphones to each individual's anatomical features.

Differences in body shape cause variations in the physical influence exerted on incoming sound waves. This phenomenon is often described in terms of differences in the Head-Related Transfer Function (HRTF).

Figure 7 shows examples of measured HRTFs from several final employees, clearly demonstrating how the physical impact on incoming sound waves varies according to individual morphology.

Figure 7 : Differences in HRTFs among individuals

In recent years, home-theater environments have become increasingly popular, allowing users to enjoy content produced using techniques designed to recreate sound from three-dimensional directions—often referred to as spatial audio, 3D audio, or immersive audio. Such content is typically mixed for playback through multichannel loudspeaker configurations distributed in three-dimensional space.

By applying binaural rendering, the same multichannel signals can be converted into two-channel binaural signals, enabling listeners to experience three-dimensional sound fields even through earphones. In this paper, we refer to sound reproduced from such two-channel binaural signals as immersive binaural sound.

To ensure that listeners perceive the spatial impression intended by the content creator, personalization techniques have been studied and implemented in various commercial products. We refer to this category of personalization as spatial-impression personalization.

However, the focus of this article—titled Timbre Personalization for Earphones—is not on spatial-impression personalization. Instead, it explains personalization in the timbral domain, that is, adapting the timbral characteristics of reproduced sound to individual listeners.

The Importance of Timbre

As described above, recent advances in spatial-impression personalization have greatly improved the accuracy of spatial audio reproduction in immersive binaural listening. Numerous studies have demonstrated that using an individual's HRTF enhances the precision of sound-image localization, and several commercial products have already adopted such approaches.

However, most of this research and product development has focused primarily on spatial attributes, while the aspect that is fundamentally more crucial for the reproduction of music and general content—the naturalness of timbre—has received far less attention. If timbral fidelity is lacking, even excellent spatial imaging or an expansive soundstage cannot fully convey the music's beauty and emotional impact.

Several studies have shown that, in the appreciation of audio content, natural timbre is often more important than spatial impression. For example, Figure 8 shows the number of verbal descriptions elicited during a VR (Virtual Reality) playback experiment; the term timbre appeared most frequently, followed by expressions related to the sense of space [4].

Figure 8 : Number of verbal descriptions elicited during VR playback experiments [4]

Traditionally, improving timbral reproduction has relied mainly on equalization, guided by user preferences rather than by the listener's auditory mechanisms. Guenther Theile pointed out that, under natural listening conditions, spatial cues and timbral cues are processed separately, and that timbre, in particular, is based on a Gestalt-oriented perceptual process [1]. From this perspective, timbre reproduction should be regarded as an independent and essential issue, distinct from spatial reproduction.

What Is Timbre?

In general, timbre is defined as the auditory impression of a sound excluding its loudness and pitch. In the context of music, if we remove dynamics (changes in loudness), melody, harmony, and rhythm, what remains is timbre. From a notational perspective, timbre can also be understood as instrumentation—the choice of instruments used to perform the music written in the score.

A composer determines which instruments will perform the written melodies, harmonies, rhythms, and dynamics, thereby shaping the music's timbre.
The music we hear through earphones results from this entire process: the composition created by the composer, performed by musicians on instruments, and converted into audio signals by recording technology. The challenge, therefore, lies in whether the timbre encoded in the recorded signal can be faithfully reproduced through earphones.

When listening to live instrumental performances, we rarely evaluate what we hear in terms of "audio quality" or "timbre character"; we simply engage with the music itself. By contrast, when listening through earphones, listeners often begin evaluating timbre or sound quality the moment they hear the sound—rather than immersing themselves in the music.

Why does this happen? One likely reason is that timbre is not being reproduced correctly, resulting in an unnatural sound. In such cases, timbral inconsistencies—or auditory artifacts—draw attention away from the music, preventing immediate engagement with the artistic content.

The goal of Timbre Personalization is to eliminate all such timbral issues or artifacts, restoring the naturalness of reproduced sound. When listening to music through earphones equipped with timbre personalization, evaluations of "sound quality" from an audiophile perspective become unnecessary—the listener can instead focus directly on the music itself.

Although we have discussed the importance of timbre, it remains a complex and elusive concept. Numerous professionals shape the music content we hear through earphones—each contributing specialized expertise in sound.

To explore how these experts understand timbre, final LAB conducted a special interview series featuring a violin maker, a composer, a violinist, and a recording engineer—professionals who engage with sound at the highest level. We encourage you to read those interviews together with this article.

Timbre Personalization for Earphones

At the outset of our research and development on timbre personalization for earphones, we deliberately restricted the listening target to two-channel stereophonic content—recordings produced in the environment shown in Figure 2, where two loudspeakers are symmetrically positioned in front of the listener. This decision was made for two reasons: first, most music content currently enjoyed through earphones is produced in this format; and second, timbre is the most critical element in stereophonic reproduction.

To realize timbre personalization, the first step is the precise measurement of the listener's body geometry and a detailed analysis of how this geometry affects incoming sound waves. Traditionally, this has been done by placing a subject in an anechoic chamber and measuring the physical effects of body morphology on sound waves arriving from all directions. In recent years, advances in measurement and computational acoustics have enabled these influences to be virtually calculated using high-resolution 3D body scans and acoustic simulations based on the resulting geometry.

For timbre personalization, it is essential to employ an auditory model capable of deriving parameters that achieve a natural timbre from the physical quantities describing the interaction between sound waves and body shape. An auditory model is a mathematical representation of how humans perceive auditory information from acoustic signals reaching the eardrum.

Numerous studies have been conducted on auditory modeling, gradually revealing the mechanisms of human auditory perception. However, these are general-purpose auditory models, intended to describe how the auditory system perceives sounds in the natural environment. They are not explicitly designed for earphone reproduction, and achieving a complete model remains a long-term challenge.

Therefore, to realize timbre personalization for earphones in the near term, we developed a dedicated auditory model optimized exclusively for natural listening through earphones, rather than relying on general frameworks.

During the development of this specialized model, we analyzed today's entertainment listening environment—where a wide variety of content is experienced through earphones—and identified two primary perceptual domains: spatial impression and timbre perception. Our auditory model was implemented as a dual-domain framework that separates physical information derived from body-shape influence into timbre-related and spatial-impression-related components, allowing each to yield parameters suited to its respective purpose.

As described earlier, our initial goal has been to faithfully reproduce the timbre intended by the content creator in two-channel stereo. We define this faithful reproduction of the creator's intended timbre as the realization of natural timbre.

Natural Timbre

Let us now revisit what we mean by "natural timbre" in the context of earphone listening. Because this concept is somewhat abstract, it may be easier to understand through an analogy with color perception.

Imagine drawing a picture with a black pen on black paper—it would be nearly impossible to perceive the artist's intended expression. In contrast, drawing the same picture with a black pen on pure white paper allows the viewer to recognize the artist's intent effortlessly. If the artwork were drawn using multiple colors, the difference between viewing it on black paper and on white paper would be even more striking.

Even in less extreme cases—gray, reddish, or greenish paper—the perceived visual impression, particularly color perception, would differ from that of the same image on white paper. In this analogy, Timbre Personalization for Earphones can be understood as the process of turning the listening "canvas" into a perfectly white canvas. This condition corresponds to the perceptual state in which the listener recognizes natural timbre.
When this state is achieved, the listener can experience the creator's intended sonic expression exactly as it was embedded in the content—just as one can perceive an artist's intention clearly when viewing a picture on white paper.

Achieving this state through conventional means—such as equalizers or generic target curves—is not theoretically impossible, but in practice exceedingly difficult, if not unattainable. Because the acoustic influence of individual anatomy varies from person to person, only the listener themselves can truly "create" their own natural timbre.

Traditional generic target curves are designed by engineers who assume an average auditory perception across many people and attempt to produce a response that most listeners would perceive as natural. While somewhat effective, this approach inevitably generalizes individual differences and cannot guarantee natural timbre for every listener.

Could a highly skilled designer achieve natural timbre by ear? The answer is no. Designing via target curves or equalization inherently requires listening tests, but such tests rely on content already imbued with specific sonic intentions. Even when using physically defined signals such as pink noise or white noise, there is no objective way to verify whether the designed result truly corresponds to a "white canvas."

Therefore, achieving a completely neutral "white canvas" or natural timbre cannot be achieved even by individuals with exceptional auditory training or extensive expertise.

To overcome this limitation, we developed a method to realize natural timbre mathematically, without relying on human auditory judgment. This approach employs our dedicated auditory model, designed to reproduce natural timbre purely computationally. This technology is what we call Timbre Personalization for Earphones.

Comparison Between General Target Curves and Timbre-Personalized Target Curves

A comparative subjective evaluation was conducted using the Semantic Differential (SD) method to assess the timbre of earphones designed with general target curves versus those employing timbre personalization.

Two representative general target curves widely used in commercial earphone products—namely, the Diffuse-Field Target Curve (DFTC) and the Listener-Preferred Target Curve (LPTC)—were compared against the Personalized Timbre Target Curve (PTTC) derived from the proposed timbre-personalization method.

The results, summarized in Table 1 and Figure 9, show that the PTTC achieved significantly higher ratings across most of the SD evaluation terms. This indicates that the timbre obtained through timbre personalization for earphones was consistently perceived as more natural and favorable compared with that produced by conventional general target curves [5].

Table 1 : Evaluation terms used in the SD-method subjective assessment [5]

Figure 9 : Mean rating scores and 95% confidence intervals for each target curve based on the SD-method evaluation [5]

Timbre Personalization for Earphones in TONALITE

The first product to incorporate Timbre Personalization for Earphones was the ZE8000. In the ZE8000, this technology was introduced under the more accessible name JDH (Jibun Dummy Head). The following product to adopt this technology is TONALITE, which implements the timbre-personalization function within a framework called DTAS (Digital Twin Audio Simulation).

The process begins with scanning the listener's body geometry and performing acoustic measurements. In ZE8000 JDH, users had to visit final's facility to undergo precise 3D scanning and acoustic measurement. In TONALITE, however, these steps can be performed independently by the user using a smartphone.

To enable this, we developed new technologies that extract the essential morphological information required for timbre personalization directly from smartphone images, along with an original smartphone-and-earphone-based acoustic measurement method. The body-shape data captured via smartphone and the acoustic measurement data obtained from the earphones are transmitted over the internet to a dedicated server. On this server, final's proprietary acoustic simulation technology processes the data to extract the information necessary for timbre personalization.

Next, using the extracted information, final's proprietary auditory model—implemented on the same server—performs mathematical processing to calculate the parameters required to realize timbre personalization in TONALITE. In ZE8000, engineers had to perform these calculations manually on a workstation. For TONALITE, however, we developed a fully automated system that computes these parameters server-side using the auditory model, eliminating the need for manual operation.

Once calculated, the timbre-personalization parameters are transmitted from the server to the TONALITE earphones, where they are stored and applied. At this point, the timbre-personalization process for TONALITE is complete, allowing users to enjoy music and other audio content with natural timbre. Moreover, the new technologies developed for this product enable users to experience natural timbre more easily and seamlessly than ever before.

Timbre Personalization and Immersive Binaural Sound

As described earlier, the timbre personalization implemented in TONALITE is designed primarily to reproduce the natural timbre of two-channel stereo content. One might then ask: How does it affect immersive binaural sound?

The auditory model used in timbre personalization separates the physical information derived from body-shape influence on incoming sound waves into two components: timbre-related and spatial-impression-related. These two categories are processed independently within the model to derive the appropriate parameters for each perceptual domain.

Consequently, the timbre personalization implemented in TONALITE operates independently of spatial impression, focusing solely on the timbral domain. As a result, it has minimal impact on spatial perception. When listening to immersive binaural sound, the spatial impression remains intact, while timbre reproduction is improved—allowing the listener to enjoy a more natural timbre within the preserved three-dimensional sound field.

Future Development

Research and development on timbre personalization have already entered the next phase, aiming to advance beyond the current system. We plan to introduce these new developments once they reach the stage of practical implementation.

References

[1] G. Theile, “On the Standardization of the Frequency Response of High-Quality Studio Headphones,” J. Audio Eng. Soc., vol. 34, no. 12, pp. 956–969 (1986)
[2] C.J. Struck, "Free Plus Diffuse Sound Field Target Earphone Response Derived From Classical Room Acoustics Theory," AES convention paper 8993, New York, USA (2013).
[3] S. E. Olive, T. Welti, and E. McMullin, “Listener Preference for In-Room Loudspeaker and Headphone Target Responses,” AES Convention paper 8994, New York, USA (2013).
[4] F. Rumsey, “Perceptual evaluation - Listening strategies, methods, and VR,” J. Audio Eng. Soc., Vol. 66, No. 4 (2018).
[5] K. Hamasaki, N. Tojo, A. Hara, H. Hirai, S. Saito, M. Hosoo, “Personalized Timbre Optimization Based on a New Auditory Model for Stereophonic Sound Reproduction via Earphones,” AES International Conference on Headphone Technology paper 12, Helsinki, Finland (2025).

一覧へ戻る

Weiterlesen

Interview Series with Experts in Audio Engineering, Acoustics, and the Arts — Vol. 4–7

2025.11.11

Technische Begriffe Vol.7 "ASMR"

2024.08.02