Critical overview of automatic music transcription

Korzh R.A.

Krivoy Rog national university, Ukraine

Critical overview of automatic music transcription

Introduction

Sound information represents as a collection of the sounding objects. Sounding objects of a specific musical instrument form a part. A part, played on a single musical instrument is called a melody. A collection of parts of different musical instruments forms a score. A score of the music pieces contains a full description of notes of all musical instruments, i.e. it represents an object-oriented form of the piece of music. The piece of music is also characterized by tempo and degrees of musical polyphony (a number of simultaneously sounding objects within a specific piece of music), object polyphony (a number of simultaneously sounding objects within a specific musical instrument) and instrumental polyphony (a number of simultaneously sounding musical instruments within a specific piece of music) [4].

A complete transcription would require that the pitch, timing and instrument of all the sound events be resolved. As this can be very hard or even theoretically impossible in some cases, the goal is usually redefined as being either to notate as many of the constituent sounds as possible (complete transcription) or to transcribe only some well-defined part of the music signal, for example the dominant melody or the most prominent drum sounds (partial transcription) [1].

The process of object musical identification breaks into the three directions [1, 5]:

1. multipitch estimation (note and chord identification);

2. musical metre estimation (measure, beat and tatum identification);

3. instrument identification (instrument timbre matching).

According to [1, 3, 5], it is possible to create a complementary classification of the pieces of music due to their object content (table 1). Figure 1 shows that a greater percent of developments is aimed to be oriented at the musical class such as "many notes – many instruments".

Table 1 – Object classification of musical masterpieces

Instruments

Notes

One

Many

One

very rarely

exclusively rarely

Many

not often

very often

Figure 1 – Object classification of musical masterpieces

Existing methods overview

One of the most inefficient tries to realize the conversion process was related to Robert Maher [6]. Input audio signal has strictly limitations (only two separate voices are allowed and others). Thus, the approach has the following disadvantages:

1. the method has strictly limitations and applies to an extremely narrow class of musical masterpieces;

2. the method doesn't allow to determine chords and other musical objects;

3. the noise and the rest reverberations of input audio signal decrease the quality of instrumental separation;

4. separation of musical parts and timbre identification of musical instruments are not allowed.

The systems of Emiya-Badeau-David [7] and Kashino-Murace [8] have a comparatively better performance. They use time segment matching with the huge pattern collection from the special databases. This approach is very robust and imprecise, which proved in practice and has the following disadvantages:

1. the method of robust spectrum matching ignores small frequency components of the audio signal which correspond to the fundamental frequencies;

2. high tempo analysis has a bad quality of recognition of sounding objects;

3. A bigger degree of musical polyphony determines a lot of mistakes in the recognition procedures;

4. there are recognition errors in the bass and tremolo parts in the musical masterpieces.

Another widespread trend among the researchers is using discrete Fourier transform algorithms. Striking example is the project of Takuya Fujishima showed in the figure 2 [9].

Figure 2 – Fragment of the algorithm of Takuya Fujishima

The current approach has the following disadvantages:

1. high quality and performance are in case of ideal audio signals, otherwise, it is necessary to implement more efficient mathematical base and recognition procedures;

2. noise component considerably decreases the chord recognition processes;

3. there is a small overlapping of the preceding and the starting notes at the octave edges;

4. separation of musical parts and timbre identification of musical instruments are not allowed.

Anssi Klapuri and Valentin Emiya convert signal spectrum into a set to estimate fundamental frequencies. However, these approaches have important disadvantages:

1. beats and tatum estimation has a bad accuracy in DFT-analysis of audio signals without expression and amplitude accents;

2. multipitch estimation has low accuracy in cases of long notes and vibrato;

3. a bigger degree of musical polyphony determines a lot of mistakes in the recognition procedures (due to partial overlapping in the frequency domain), and restricts the application to real sound recognition tasks;

4. there are errors in the low and high frequency bands in polyphonic melody analysis.

Another widespread approach of sound content identification of musical masterpieces is to use so-called blackboard systems [2]. Blackboard systems usually consist of a central dataspace called the blackboard, a set of so-called knowledge sources, and a scheduler (fig. 3).

Figure 3 – The control structure of the blackboard system

Such structured systems may be found in the researches of Keith Martin [3] and Bello-Sandler. It should be noted that these systems use a comparatively small part of the whole control structure, including a short list of knowledge sources as well (fig. 3). This is connected with the knowledge source detailization (knowledge expert block requires a complicated structure inside the central workspace). Another fact is that these knowledge sources require semantic relations to cooperate with each other. So, the task extremely grows and becomes more complicated.

Some developers apply linear regression model and various teaching methods for the harmonic overlapping separation.

Another effective approach is the application of the wavelet transforms with neural network processing. This method of identification is very similar to pattern recognition algorithms. Such methods have been implemented by Livingston-Shepard and Alexander Fadeev [4]. Their quality and performance are much bigger in comparison with the previous approaches, but, in spite of this, there are some significant problems:

1. the user determines the wavelet scale coefficients, i.e. the adaptation to the specific signal nature is missing;

2. the basis functions are Morlet wavelets, whose estimators don’t show an enough “matching” with the sound waves of an analyzed signal (because they don’t completely match to most timbre of the musical instruments), which results in bad accuracy in the pattern recognition of musical instruments.

Conclusions

A brief introduction to automatic music transcription described above shows that a great number of scientists are involved and a huge number of experiments were made to reach better results from year to year. A variety of developments proves its necessity in many industrial and scientific projects. The recent systems have a better performance and use the last mathematical achievements.

The overview of existing automatic systems showed that it is necessary to run advanced research works in the field of signal processing theory to develop a new approach which derives all the previous advantages. And, the choice of the more efficient mathematical model is also considered to increase the performance and accuracy in conversion of music information.

References:

1. Klapuri A. Signal Processing Methods for Music Transcription / A. Klapuri, M. Davy. — Springer, New York, 2006.

2. Martin, K. D. Automatic Transcription of Simple Polyphonic Music / K. D. Martin // Computer Music Journal. — 2002. — No 1(7).

3. Ellis, D. Extracting Information from Music Audio / D. Ellis // LabROSA, Dept. of Electrical Engineering Columbia University, NY, March 15, 2006.

4. Фадеев, А. С. Идентификация музыкальных объектов на основе непрерывного вейвлет-преобразования / А. С. Фадеев // Диссертация. — Томский политехнический университет. — 2008.

5. Every, M. Separation of musical sources and structure from single-channel polyphonic recordings / M. Every // PhD thesis. — Department of Electronics. — University of York. — 2006.

6. Maher, R. C. Development and evaluation of a method for the separation of musical duet signals / R. C. Maher // Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, NY, Mohonk, October, 1989. — P. 1 — 3.

7. Emiya, V. Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches / V. Emiya, R. Badeau, B. David // Proc. Int. Conf. Audio, Speech and Signal Processing (ICASSP). — 2009.

8. Kashino, K. Music recognition using note transition context / K. Kashino, H. Murase // Proc. of the 1998 IEEE ICASSP. Seattle. 1998.

9. Fujishima, T. Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music / T. Fujishima // Proc. of the International Computer Music Conference, Beijing: International Computer Music Association, China, 1999. — P. 464 — 467.