There is a growing need for high-quality musical metadata (data characteristics) to support new ways of enjoying music, including advanced music search and recommendation. Conventional manual metadata assignment is costly and can lead to other problems, such as data inconsistency.
Sony has developed a unique 12 Tone Analysis system that automatically extracts a variety of metadata, including beat, chord progression, song structure, genre, instruments and mood, by using signal processing and statistical processing to analyze musical waveforms. This technology has been used in Sony's GIGA JUKE and Rolly, and also in VAIO software.

Figure 1: 12 Tone Analysis
Applications Based on 12 Tone Analysis
With 12 tone analysis, metadata can be applied to all songs automatically. The following are some examples of the types of applications that can be implemented using automatically extracted metadata.
- Searching for songs with specific characteristics (fast, bright, etc.)
- Searching for songs with similar metadata to find songs similar to one's favorites
- Continuous playback of just the chorus (main part of the song) sections of multiple songs
- Automatic creation of slideshows, etc., based on the mood of songs
- Automatic classification of radio shows into music and talk
How 12 Tone Analysis Works
With 12 tone analysis, music is analyzed through the following processes.
Time-Tone Analysis
The 12 tone analysis process begins with a two-dimensional analysis of the song based on time and tone. There are 12 tones (equivalent to the do-re-mi scale) per octave. When this analysis is performed first, it becomes easier to extract the information needed to carry out subsequent processes, including the detection of the timing and strength of the initial sound, timbre and chord structures.
The filters developed for 12 tone analysis allow rapid high-precision analysis from bass to treble.
Analysis Based on Musical Theory
Using the two-dimensional image obtained through this analysis, a variety of signal processes and detection processes are then carried out to detect features based on musical theory, such as beat elements, including tempo, rhythm and bar, as well as chord progression, key, and song structure.
Previously element technologies, such as chord detection, song structure detection, were treated separately. With 12 tone analysis, all detection processes are integrated, allowing estimation based on the reciprocal use of multiple detection results. This ensures extremely accurate detection processing.
Feature Extraction
The results of time-tone analysis and analyses based on musical theory are next used to extract features that can be used to classify songs. The 12 tone analysis system brought to market by Sony uses several dozen highly independent features to support the classification of music from various perspectives.
Metadata Estimation
Finally, the features obtained through these musical analysis processes are used to estimate metadata, examples of which are listed below. The resulting metadata can be used for song searching and other purposes.
| Perceived speed |
The speed of the music as perceived by the human ear. This feature is distinguished from tempo, since the perceived speeds of songs may vary because rhythm patterns and other factors, even if the tempo is identical. |
|---|---|
| Perceived energy |
The energy of music as perceived by human ears. A quiet song will seem to have less energy, while a bright and lively song will seem to be more energetic. |
| Genre | Whether or not the song fits a particular genre, such as rock, jazz or classical Instrumental sound: Whether or not the music includes particular instruments, such as piano, bass or guitar. |
| Instrumental sound |
Whether or not the music includes particular instruments, such as piano, bass or guitar. |
| Mood | Whether or not the song fits particular mood keywords, such as "bright" or "refined." |
With 12 tone analysis, a vast amount of statistical data attached to each of several dozen metadata can be subjected to statistical analysis and machine learning, resulting in extremely accurate metadata estimation.
