musicmouse/espmusicmouse/host_driver/C5S2_ChordRec_Templates.ipynb

675 lines
146 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
"<a href=\"http://www.music-processing.de/\"><img style=\"float:left;\" src=\"../data/FMP_Teaser_Cover.png\" width=40% alt=\"FMP\"></a>\n",
"<a href=\"https://www.audiolabs-erlangen.de\"><img src=\"../data/Logo_AudioLabs_Long.png\" width=59% style=\"float: right;\" alt=\"AudioLabs\"></a>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
"<a href=\"../C5/C5.html\"><img src=\"../data/C5_nav.png\" width=\"100\" style=\"float:right;\" alt=\"C5\"></a>\n",
"<h1>Template-Based Chord Recognition</h1> \n",
"</div>\n",
"\n",
"<br/>\n",
"\n",
"<p>\n",
"Following Section 5.2 of <a href=\"http://www.music-processing.de/\">[Müller, FMP, Springer 2015]</a>, we introduce in this notebook a basic approach for chord recognition using chord templates. For a discussion of the importance of various algorithmic components of such a system, we refer to the following two studies:\n",
"\n",
"<ul>\n",
" \n",
"<li><span style=\"color:black\">\n",
"Taemin Cho, Juan Pablo Bello: <strong>On the Relative Importance of Individual Components of Chord Recognition Systems.</strong> IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22 (2014), pp. 466&ndash;492. \n",
"<br> \n",
"<a type=\"button\" class=\"btn btn-default btn-xs\" target=\"_blank\" href=\"../data/bibtex/FMP_bibtex_ChoB14_Chord_IEEE-TASLP.txt\"> Bibtex </a>\n",
"</span></li>\n",
"\n",
"<li><span style=\"color:black\">\n",
"Nanzhu Jiang, Peter Grosche, Verena Konz, Meinard Müller: <strong>Analyzing Chroma Feature Types for Automated Chord Recognition.</strong> Proceedings of the AES Conference on Semantic Audio, Ilmenau, Germany, 2011. \n",
"<br> \n",
"<a type=\"button\" class=\"btn btn-default btn-xs\" target=\"_blank\" href=\"../data/bibtex/FMP_bibtex_JiangGKM11_Chord_AES.txt\"> Bibtex </a>\n",
"</span></li>\n",
"\n",
"</ul> \n",
"</p> "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"In music, **harmony** refers to the simultaneous sound of different notes that form a cohesive entity in the mind of the listener. The main constituent components of harmony, at least in the Western music tradition, are [**chords**](../C5/C5S1_Chords.html), which are musical constructs\n",
"that typically consist of three or more notes. Harmony analysis may be thought of as the study of the construction, interaction, and progression of chords. In this notebook, we discuss a subproblem of harmonic analysis referred to as **chord recognition**, where we consider only a small number of the most important chords as occurring in Western music. Furthermore, we assume that the piece of music is given in the form of an **audio recording**. The resulting chord recognition task consists in splitting up the recording into **segments** and assigning a **chord label** to each segment. The segmentation specifies the start and end time of a chord, and the chord label specifies which chord is played during this time period. A typical chord recognition system consists of **two main steps**. \n",
"\n",
"* In the first step, the given audio recording is cut into frames, and each frame is transformed into an appropriate **feature vector**. Most recognition systems rely on [**chroma-based audio features**](../C3/C3S1_SpecLogFreq-Chromagram.html), which correlate to the underlying tonal information contained in the audio signal. \n",
"* In the second step, **pattern matching** techniques are used to map each feature vector to a set of predefined **chord models**. The best fit determines the chord label assigned to the given frame. \n",
"\n",
"To improve the chord recognition results, additional enhancement techniques are applied either before the pattern matching step (referred to as [**prefiltering**](../C5/C5S3_ChordRec_HMM.html)) or after/within the pattern matching step \n",
"(referred to as [**postfiltering**](../C5/C5S3_ChordRec_HMM.html)). In this notebook, we introduce a first chord recognition procedure that employs a simple template-based matching strategy.\n",
"\n",
"<img src=\"../data/C5/FMP_C5_F13.png\" width=\"500px\" align=\"middle\" alt=\"FMP_C5_F13\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Beatles Example\n",
"\n",
"In the following example, the chord recognition task is illustrated by the first measures of the Beatles song \"Let It Be.\" The figure shows a score representation, the recording's waveform, a [chromagram](../C3/C3S1_SpecLogFreq-Chromagram.html), as well as chord annotations generated in a manual process. \n",
"\n",
"<img src=\"../data/C5/FMP_C5_F01.png\" width=\"400px\" align=\"left\" alt=\"FMP_C5_F01\">\n",
"\n",
"<br clear=\"all\" />\n",
"\n",
"<audio style=\"width: 400px;\" src=\"../data/C5/FMP_C5_F01_Beatles_LetItBe-mm1-4_Original.mp3\" type=\"audio/mpeg\" controls=\"controls\"></audio>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chroma-Based Feature Representation\n",
"\n",
"Given an audio recording of a piece of music, the first step is to transform the recording into a sequence $X=(x_1,x_2,\\ldots,x_N)$ of feature vectors $x_n\\in\\mathcal{F}$, $n\\in[1:N]$, where $\\mathcal{F}$ denotes a suitable feature space. As mentioned above, nearly all traditional chord recognition procedures rely on some type of [**chroma-based feature representation**](../C3/C3S1_SpecLogFreq-Chromagram.html). This is because chroma-based features capture a signal's short-time tonal content, which is closely correlated to the harmonic progression of the underlying piece. Assuming the [equal-tempered scale](../C1/C1S1_MusicalNotesPitches.html), the chroma values correspond to the set $\\{\\mathrm{C},\\mathrm{C}^\\sharp,\\mathrm{D},\\ldots,\\mathrm{B}\\}$, which we identify with the set $[0:11]$. A chroma feature can then be expressed as a $12$-dimensional vector \n",
"\n",
"$$\n",
"x=(x(0),x(1),\\ldots,x(11))^\\top\\in\\mathcal{F}=\\mathbb{R}^{12}\n",
"$$ \n",
"\n",
"As we also discuss in other FMP notebooks (e.g., in the context of [music synchronization](../C3/C3S1_SpecLogFreq-Chromagram.html) or [content-based music retrieval](../C7/C7S2_CENS.html)), there are many different ways of computing chroma features. Furthermore, their properties can be adjusted by applying suitable postprocessing steps such as [logarithmic compression](../C3/C3S1_LogCompression.html), [normalization](../C3/C3S1_FeatureNormalization.html), or [smoothing](../C3/C3S1_FeatureSmoothing.html). To give some example, we compute in the following code cell three different chroma variants: \n",
"\n",
"* STFT-based chroma features (`librosa.feature.chroma_stft`). \n",
"* Filter-bank decomposition using IIR elliptic filters (`librosa.iirt`), logarithmic compression, and chroma binning.\n",
"* CQT-based chroma features (`librosa.feature.chroma_cqt`)\n",
"\n",
"For each variant, we use the same window length (`N=4096`) and hop size (`H=2048`). Furthermore, in each variant, we normalize the chroma vectors with respect to the [Euclidean norm](../C3/C3S1_FeatureNormalization.html) ($\\ell^2$-norm). In the following figure, the resulting chromagrams are visually superimposed with the manually generated chord annotations (in color). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/martin/code/musicmouse/espmusicmouse/venv/lib/python3.8/site-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.\n",
" warnings.warn(\"PySoundFile failed. Trying audioread instead.\")\n",
"/home/martin/code/musicmouse/espmusicmouse/venv/lib/python3.8/site-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.\n",
" warnings.warn(\"PySoundFile failed. Trying audioread instead.\")\n",
"/home/martin/code/musicmouse/espmusicmouse/venv/lib/python3.8/site-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.\n",
" warnings.warn(\"PySoundFile failed. Trying audioread instead.\")\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 504x576 with 10 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import os\n",
"import numpy as np\n",
"from matplotlib import pyplot as plt\n",
"import librosa\n",
"\n",
"import sys\n",
"import libfmp.b\n",
"import libfmp.c3\n",
"import libfmp.c4\n",
"%matplotlib inline\n",
"\n",
"test_file = \"/home/martin/Music/deemix Music/Simone Sommerland - Ki-Ka-Kinderturnen.mp3\"\n",
"\n",
"def compute_chromagram_from_filename(fn_wav, Fs=22050, N=4096, H=2048, gamma=None, version='STFT', norm='2'):\n",
" \"\"\"Compute chromagram for WAV file specified by filename\n",
"\n",
" Notebook: C5/C5S2_ChordRec_Templates.ipynb\n",
"\n",
" Args:\n",
" fn_wav (str): Filenname of WAV\n",
" Fs (scalar): Sampling rate (Default value = 22050)\n",
" N (int): Window size (Default value = 4096)\n",
" H (int): Hop size (Default value = 2048)\n",
" gamma (float): Constant for logarithmic compression (Default value = None)\n",
" version (str): Technique used for front-end decomposition ('STFT', 'IIS', 'CQT') (Default value = 'STFT')\n",
" norm (str): If not 'None', chroma vectors are normalized by norm as specified ('1', '2', 'max')\n",
" (Default value = '2')\n",
"\n",
" Returns:\n",
" X (np.ndarray): Chromagram\n",
" Fs_X (scalar): Feature reate of chromagram\n",
" x (np.ndarray): Audio signal\n",
" Fs (scalar): Sampling rate of audio signal\n",
" x_dur (float): Duration (seconds) of audio signal\n",
" \"\"\"\n",
" x, Fs = librosa.load(fn_wav, sr=Fs)\n",
" x_harmonic, y_percussive = librosa.effects.hpss(x) \n",
" x = x_harmonic\n",
" x_dur = x.shape[0] / Fs\n",
" if version == 'STFT':\n",
" # Compute chroma features with STFT\n",
" X = librosa.stft(x, n_fft=N, hop_length=H, pad_mode='constant', center=True)\n",
" if gamma is not None:\n",
" X = np.log(1 + gamma * np.abs(X) ** 2)\n",
" else:\n",
" X = np.abs(X) ** 2\n",
" X = librosa.feature.chroma_stft(S=X, sr=Fs, tuning=0, norm=None, hop_length=H, n_fft=N)\n",
" if version == 'CQT':\n",
" # Compute chroma features with CQT decomposition\n",
" X = librosa.feature.chroma_cqt(y=x, sr=Fs, hop_length=H, norm=None)\n",
" if version == 'IIR':\n",
" # Compute chroma features with filter bank (using IIR elliptic filter)\n",
" X = librosa.iirt(y=x, sr=Fs, win_length=N, hop_length=H, center=True, tuning=0.0)\n",
" if gamma is not None:\n",
" X = np.log(1.0 + gamma * X)\n",
" X = librosa.feature.chroma_cqt(C=X, bins_per_octave=12, n_octaves=7,\n",
" fmin=librosa.midi_to_hz(24), norm=None)\n",
" if norm is not None:\n",
" X = libfmp.c3.normalize_feature_sequence(X, norm='2')\n",
" Fs_X = Fs / H\n",
" return X, Fs_X, x, Fs, x_dur\n",
"\n",
"def plot_chromagram_annotation(ax, X, Fs_X, ann, color_ann, x_dur, cmap='gray_r', title=''):\n",
" \"\"\"Plot chromagram and annotation\n",
"\n",
" Notebook: C5/C5S2_ChordRec_Templates.ipynb\n",
"\n",
" Args:\n",
" ax: Axes handle\n",
" X: Feature representation\n",
" Fs_X: Feature rate\n",
" ann: Annotations\n",
" color_ann: Color for annotations\n",
" x_dur: Duration of feature representation\n",
" cmap: Color map for imshow (Default value = 'gray_r')\n",
" title: Title for figure (Default value = '')\n",
" \"\"\"\n",
" libfmp.b.plot_chromagram(X, Fs=Fs_X, ax=ax,\n",
" chroma_yticks=[0, 4, 7, 11], clim=[0, 1], cmap=cmap,\n",
" title=title, ylabel='Chroma', colorbar=True)\n",
" libfmp.b.plot_segments_overlay(ann, ax=ax[0], time_max=x_dur,\n",
" print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"# Compute chroma features\n",
"fn_wav = test_file # os.path.join('..', 'data', 'C5', 'FMP_C5_F01_Beatles_LetItBe-mm1-4_Original.wav')\n",
"N = 4096 * 32\n",
"H = 2048 * 32\n",
"X_STFT, Fs_X, x, Fs, x_dur = compute_chromagram_from_filename(fn_wav, N=N, H=H, gamma=0.1, version='STFT')\n",
"X_IIR, Fs_X, x, Fs, x_dur = compute_chromagram_from_filename(fn_wav, N=N, H=H, gamma=100, version='IIR')\n",
"X_CQT, Fs_X, x, Fs, x_dur = compute_chromagram_from_filename(fn_wav, N=N, H=H, version='CQT')\n",
"\n",
"# Annotations\n",
"fn_ann = test_file #os.path.join('..', 'data', 'C5', 'FMP_C5_F01_Beatles_LetItBe-mm1-4_Original_Chords_simplified.csv')\n",
"#ann, _ = libfmp.c4.read_structure_annotation(fn_ann)\n",
"color_ann = {'N': [1, 1, 1, 1], 'C': [1, 0.5, 0, 1], 'G': [0, 1, 0, 1], \n",
" 'Am': [1, 0, 0, 1], 'F': [0, 0, 1, 1]}\n",
"\n",
"# Plot\n",
"cmap = libfmp.b.compressed_gray_cmap(alpha=1, reverse=False)\n",
"fig, ax = plt.subplots(5, 2, gridspec_kw={'width_ratios': [1, 0.03], \n",
" 'height_ratios': [1, 2, 2, 2, 0.5]}, figsize=(7, 8))\n",
"libfmp.b.plot_signal(x, Fs, ax=ax[0,0], title='Waveform of audio signal')\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[0,0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"ax[0,1].axis('off')\n",
"\n",
"#title = 'STFT-based chromagram (feature rate = %0.1f Hz, N = %d)'%(Fs_X, X_STFT.shape[1])\n",
"#plot_chromagram_annotation([ax[1, 0], ax[1, 1]], X_STFT, Fs_X, ann, color_ann, x_dur, title=title)\n",
"\n",
"#title = 'IIR-based chromagram (feature rate = %0.1f Hz, N = %d)'%(Fs_X, X_IIR.shape[1])\n",
"#plot_chromagram_annotation([ax[2, 0], ax[2, 1]], X_IIR, Fs_X, ann, color_ann, x_dur, title=title)\n",
"\n",
"#title = 'CQT-based chromagram (feature rate = %0.1f Hz, N = %d)'%(Fs_X, X_CQT.shape[1])\n",
"#plot_chromagram_annotation([ax[3, 0], ax[3, 1]], X_CQT, Fs_X, ann, color_ann, x_dur, title=title)\n",
"\n",
"#libfmp.b.plot_segments(ann, ax=ax[4, 0], time_max=x_dur, time_label='Time (seconds)',\n",
"# colors=color_ann, alpha=0.3)\n",
"ax[4,1].axis('off')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Template-Based Pattern Matching\n",
"\n",
"Given the chroma sequence $X=(x_1,x_2,\\ldots,x_N)$ and a set $\\Lambda$ of possible chord labels, the objective of the next step is to map each chroma vector $x_n\\in\\mathbb{R}^{12}$ to a chord label $\\lambda_{n} \\in \\Lambda$, $n\\in[1:N]$. For example, one may consider the set\n",
"\n",
"\\begin{equation}\n",
" \\Lambda = \\{\\mathbf{C},\\mathbf{C}^\\sharp,\\ldots,\\mathbf{B},\\mathbf{Cm},\\mathbf{C^\\sharp m},\\ldots,\\mathbf{Bm}\\}\n",
"\\end{equation}\n",
"\n",
"consisting of the [twelve major and twelve minor triads](../C5/C5S1_Chords.html). In this case, each frame $n\\in[1:N]$ is assigned to a [major chord](../C5/C5S1_Chords.html) or a [minor chord](../C5/C5S1_Chords.html) specified by $\\lambda_{n}$. For the pattern matching step, we now introduce a simple template-based approach. The idea is to precompute a set \n",
"\n",
"$$\n",
"\\mathcal{T}\\subset\\mathcal{F}=\\mathbb{R}^{12}\n",
"$$ \n",
"\n",
"of templates denoted by $\\mathbf{t}_\\lambda\\in\\mathcal{T}$, $\\lambda\\in\\Lambda$. Intuitively, each template can be thought of as a prototypical chroma vector that represents a specific musical chord. Furthermore, we fix a similarity measure \n",
"\n",
"\\begin{equation}\n",
"s:\\mathcal{F}\\times\\mathcal{F}\\to \\mathbb{R}\n",
"\\end{equation}\n",
"\n",
"that allows for comparing different chroma vectors. Then, the template-based procedure consists in assigning the chord label that maximizes the similarity between the corresponding template and the given feature vector $x_n$:\n",
"\n",
"\\begin{equation}\n",
" \\lambda_{n} := \\underset{\\lambda \\in \\Lambda}{\\mathrm{argmax}}\n",
" \\,\\, s( \\mathbf{t}_\\lambda , x_n ).\n",
"\\end{equation}\n",
"\n",
"In this procedure, there are many design choices that crucially influence the performance of a chord recognizer. \n",
"\n",
"* Which chords should be considered in $\\mathcal{T}$? \n",
"* How are the chord templates defined? \n",
"* What is a suitable similarity measure to compare the feature vectors with the chord templates? \n",
"\n",
"To obtain a first simple chord recognition system, we make the following design choices. For the chord label set $\\Lambda$, we choose the twelve major and twelve minor triads. This choice, even though problematic from a musical point of view, is convenient and instructive. Considering chords up to enharmonic equivalence and up to octave shifts, each triad can be encoded by a three-element subset of $[0:11]$. For example, the $\\mathrm{C}$ major chord $\\mathbf{C}$ corresponds to the subset $\\{0,4,7\\}$. Each subset, in turn, can be identified with a binary twelve-dimensional chroma vector $x=(x(0),x(1),\\ldots,x(11))^\\top$, where $x(i)=1$ if and only if the chroma value $i\\in[0:11]$ is contained in the chord. For example, in the case of the $\\mathrm{C}$-major chord $\\mathbf{C}$, the resulting chroma vector is\n",
"\n",
"\\begin{equation}\n",
"\\label{eq:ChordReco:Template:Basic:ChromaVectC}\n",
" \\mathbf{t}_{\\mathbf{C}}{} := x =(1,0,0,0,1,0,0,1,0,0,0,0)^\\top.\n",
"\\end{equation}\n",
"\n",
"Using a chroma-based encoding, the twelve major chords and twelve minor chords can be obtained by [cyclically shifting](../C3/C3S1_TranspositionTuning.html) the binary vectors for the $\\mathrm{C}$-major and the $\\mathrm{C}$-minor triads, respectively. The result is illustrated by the following figure.\n",
"\n",
"<img src=\"../data/C5/FMP_C5_F06.png\" width=\"700px\" align=\"middle\" alt=\"FMP_C5_F06\">\n",
"\n",
"For comparing chroma features and chord templates, we use in the following a simple similarity measure using the inner product of normalized vectors:\n",
"\n",
"\\begin{equation}\n",
" s(x,y)= \\frac{\\langle x,y\\rangle}{\\|x\\|\\cdot\\|y\\|}\n",
"\\end{equation}\n",
"\n",
"for $x,y\\in\\mathcal{F}$ with $\\|x\\|\\not= 0$ and $\\|y\\|\\not= 0$. In the case $\\|x\\|=0$ or $\\|y\\|=0$, we set $s(x,y)=0$. Note that this measure always yields a value $s(x,y)\\in[-1,1]$. In the case that the vectors $x$ and $y$ only have positive entries, one has $s(x,y)\\in[0,1]$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation\n",
"\n",
"In the following code cell, we provide an implementation for the template-based chord recognition procedure described before. To obtain a better understanding of this procedure, we continue our Beatles example from above. The following steps are performed and visualized:\n",
"\n",
"* First, the audio recording is converted into a chroma representation. As an example, we use the STFT-variant as computed before. \n",
"\n",
"* Second, each chroma vector is compared with each of the $24$ binary chord templates, which yields $24$ similarity values per frame. These similarity values are visualized in the form of a **time&ndash;chord representation**.\n",
"\n",
"* Third, we select for each frame the chord label $\\lambda_{n}$ of the template that maximizes the similarity value over all $24$ chord templates. This yields our final chord recognition result, which is shown in the form of a **binary time&ndash;chord representation**.\n",
"\n",
"* Fourth, the manually generated chord annotations are visualized.\n",
"\n",
"In the following figure, all visualizations are superimposed with the manually generated chord annotations. \n",
"\n",
"<!--<img src=\"../data/C5/FMP_C5_F15.png\" width=\"400px\" align=\"middle\" alt=\"FMP_C5_F15\">-->"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x720 with 8 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def get_chord_labels(ext_minor='m', nonchord=False):\n",
" \"\"\"Generate chord labels for major and minor triads (and possibly nonchord label)\n",
"\n",
" Notebook: C5/C5S2_ChordRec_Templates.ipynb\n",
"\n",
" Args:\n",
" ext_minor (str): Extension for minor chords (Default value = 'm')\n",
" nonchord (bool): If \"True\" then add nonchord label (Default value = False)\n",
"\n",
" Returns:\n",
" chord_labels (list): List of chord labels\n",
" \"\"\"\n",
" chroma_labels = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']\n",
" chord_labels_maj = chroma_labels\n",
" chord_labels_min = [s + ext_minor for s in chroma_labels]\n",
" chord_labels = chord_labels_maj + chord_labels_min\n",
" if nonchord is True:\n",
" chord_labels = chord_labels + ['N']\n",
" return chord_labels\n",
"\n",
"def generate_chord_templates(nonchord=False):\n",
" \"\"\"Generate chord templates of major and minor triads (and possibly nonchord)\n",
"\n",
" Notebook: C5/C5S2_ChordRec_Templates.ipynb\n",
"\n",
" Args:\n",
" nonchord (bool): If \"True\" then add nonchord template (Default value = False)\n",
"\n",
" Returns:\n",
" chord_templates (np.ndarray): Matrix containing chord_templates as columns\n",
" \"\"\"\n",
" template_cmaj = np.array([1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]).T\n",
" template_cmin = np.array([1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]).T\n",
" \n",
" num_chord = 24\n",
" if nonchord:\n",
" num_chord = 25\n",
" chord_templates = np.ones((12, num_chord))\n",
" for shift in range(12):\n",
" chord_templates[:, shift] = np.roll(template_cmaj, shift)\n",
" chord_templates[:, shift+12] = np.roll(template_cmin, shift)\n",
" return chord_templates\n",
"\n",
"def chord_recognition_template(X, norm_sim='1', nonchord=False):\n",
" \"\"\"Conducts template-based chord recognition\n",
" with major and minor triads (and possibly nonchord)\n",
"\n",
" Notebook: C5/C5S2_ChordRec_Templates.ipynb\n",
"\n",
" Args:\n",
" X (np.ndarray): Chromagram\n",
" norm_sim (str): Specifies norm used for normalizing chord similarity matrix (Default value = '1')\n",
" nonchord (bool): If \"True\" then add nonchord template (Default value = False)\n",
"\n",
" Returns:\n",
" chord_sim (np.ndarray): Chord similarity matrix\n",
" chord_max (np.ndarray): Binarized chord similarity matrix only containing maximizing chord\n",
" \"\"\"\n",
" chord_templates = generate_chord_templates(nonchord=nonchord)\n",
" X_norm = libfmp.c3.normalize_feature_sequence(X, norm='2')\n",
" chord_templates_norm = libfmp.c3.normalize_feature_sequence(chord_templates, norm='2')\n",
" chord_sim = np.matmul(chord_templates_norm.T, X_norm)\n",
" if norm_sim is not None:\n",
" chord_sim = libfmp.c3.normalize_feature_sequence(chord_sim, norm=norm_sim)\n",
" # chord_max = (chord_sim == chord_sim.max(axis=0)).astype(int)\n",
" chord_max_index = np.argmax(chord_sim, axis=0)\n",
" chord_max = np.zeros(chord_sim.shape).astype(np.int32)\n",
" for n in range(chord_sim.shape[1]):\n",
" chord_max[chord_max_index[n], n] = 1\n",
"\n",
" return chord_sim, chord_max\n",
"\n",
"# Chord recognition\n",
"X = X_STFT\n",
"chord_sim, chord_max = chord_recognition_template(X, norm_sim='max')\n",
"chord_labels = get_chord_labels(nonchord=False)\n",
"\n",
"# Plot\n",
"cmap = libfmp.b.compressed_gray_cmap(alpha=1, reverse=False)\n",
"fig, ax = plt.subplots(4, 2, gridspec_kw={'width_ratios': [1, 0.03], \n",
" 'height_ratios': [1.5, 3, 3, 0.3]}, figsize=(8, 10))\n",
"libfmp.b.plot_chromagram(X, ax=[ax[0,0], ax[0,1]], Fs=Fs_X, clim=[0, 1], xlabel='',\n",
" title='STFT-based chromagram (feature rate = %0.1f Hz)' % (Fs_X))\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[0,0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"libfmp.b.plot_matrix(chord_sim, ax=[ax[1, 0], ax[1, 1]], Fs=Fs_X, \n",
" title='Timechord representation of chord similarity matrix',\n",
" ylabel='Chord', xlabel='')\n",
"ax[1, 0].set_yticks(np.arange( len(chord_labels) ))\n",
"ax[1, 0].set_yticklabels(chord_labels)\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[1, 0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"libfmp.b.plot_matrix(chord_max, ax=[ax[2, 0], ax[2, 1]], Fs=Fs_X, \n",
" title='Timechord representation of chord recognition result',\n",
" ylabel='Chord', xlabel='')\n",
"ax[2, 0].set_yticks(np.arange( len(chord_labels) ))\n",
"ax[2, 0].set_yticklabels(chord_labels)\n",
"ax[2, 0].grid()\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[2, 0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"#libfmp.b.plot_segments(ann, ax=ax[3, 0], time_max=x_dur, time_label='Time (seconds)',\n",
"# colors=color_ann, alpha=0.3)\n",
"ax[3, 1].axis('off')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"total = np.sum(chord_max, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(24, 34)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chord_max.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1414141414141414"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"14 / (60+39)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(0, 'A#'),\n",
" (0, 'A#m'),\n",
" (0, 'Am'),\n",
" (0, 'Bm'),\n",
" (0, 'C'),\n",
" (0, 'C#'),\n",
" (0, 'Cm'),\n",
" (0, 'D#'),\n",
" (0, 'D#m'),\n",
" (0, 'Dm'),\n",
" (0, 'Em'),\n",
" (0, 'F'),\n",
" (0, 'Fm'),\n",
" (0, 'G#'),\n",
" (1, 'F#'),\n",
" (1, 'G'),\n",
" (1, 'Gm'),\n",
" (3, 'B'),\n",
" (3, 'D'),\n",
" (3, 'E'),\n",
" (4, 'C#m'),\n",
" (4, 'G#m'),\n",
" (5, 'F#m'),\n",
" (9, 'A')]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted([(e2, e1) for e1, e2 in zip(chord_labels, total)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The chord similarity values shown in the form of a timechord representation indicate for each chroma vector a kind of likelihood for the $24$ possible chords. For example, this visualization shows that the chroma vectors at the beginning of the Beatles song are most similar to the template for the $\\mathrm{C}$ major chord $\\mathbf{C}$. Furthermore, there is also a higher degree of similarity to the templates for $\\mathbf{C}$, $\\mathbf{Em}$, and $\\mathbf{Am}$. Comparing the final chord recognition results with the reference annotation, one can observe that the results obtained from the automated procedure agree with the reference labels for most of the frames. We continue with our discussion of the results in the [FMP notebook on chord recognition evaluation](../C5/C5S2_ChordRec_Eval.html), where we also consider further examples.\n",
"\n",
"We close our Beatles example by looking at the similarity-maximizing chord templates for each frame. This yields a time&dash;chroma representation, which can be compared with the original input chromagram. In a way, the sequence of chord templates may be thought of as a musically informed quantization of the input chroma representation."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x360 with 6 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"chord_templates = generate_chord_templates() \n",
"X_chord = np.matmul(chord_templates, chord_max)\n",
"\n",
"# Plot\n",
"fig, ax = plt.subplots(3, 2, gridspec_kw={'width_ratios': [1, 0.03], \n",
" 'height_ratios': [1, 1, 0.2]}, figsize=(8, 5))\n",
"\n",
"libfmp.b.plot_chromagram(X, ax=[ax[0, 0], ax[0, 1]], Fs=Fs_X, clim=[0, 1], xlabel='',\n",
" title='STFT-based chromagram (feature rate = %0.1f Hz)' % (Fs_X))\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[0, 0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"libfmp.b.plot_chromagram(X_chord, ax=[ax[1, 0], ax[1, 1]], Fs=Fs_X, clim=[0, 1], xlabel='',\n",
" title='Binary templates of the chord recognition result')\n",
"#libfmp.b.plot_segments_overlay(ann, ax=ax[1, 0], time_max=x_dur,\n",
"# print_labels=False, colors=color_ann, alpha=0.1)\n",
"\n",
"#libfmp.b.plot_segments(ann, ax=ax[2, 0], time_max=x_dur, time_label='Time (seconds)',\n",
"# colors=color_ann, alpha=0.3)\n",
"ax[2,1].axis('off')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Further Notes\n",
"\n",
"In this notebook, we have studied the problem of chord recognition with the objective of automatically extracting chord labels from a given music recording. Only considering the $24$ major and minor triads in a simplistic scenario, we introduced a first template-based matching procedure that compares chroma features of the music recording with prototypical chroma templates of the triads. Already with this simple baseline approach, there are many design choices that have a substantial influence on the final chord recognition results. \n",
"\n",
"* In this notebook, we have used **idealized binary chord templates** that indicate the presence or absence of notes in the given chord. For real music recordings, however, the presence of [harmonics](../C1/C1S3_HarmonicSeries.html) and other sound components leads to chroma features where the energy is spread over the chroma bands in a more unstructured, non-binary fashion. This motivates the usage of **chord templates with harmonics**, where the chroma patterns also account for the harmonics of the chord notes.\n",
"\n",
"* Instead of explicitly modeling the harmonics, a conceptually different approach is to **learn** chroma patterns from labeled training data. The input of such a **supervised learning** procedure consists of pairs of a chroma vector and a corresponding chord label. A simple way is then to derive chord templates by suitably averaging chroma vectors that are labeled with the same chord. \n",
"\n",
"* Taking the average templates is a simple way to adapt a template-based chord recognizer to given training data. More involved approaches are often based on statistical models that capture not only the **averages** but also the **variances** in the training data. In such approaches, the templates are replaced by chord models that are specified by, e.g., **Gaussian distributions** given in terms of a mean vector and a covariance matrix. The similarity of a given chroma vector to a chord model is then expressed by a Gaussian probability value and the assigned label is determined by the probability-maximizing chord model. The discussion of such statistical approaches is beyond the scope of these notebooks, and we refer the reader to the overview article by Cho and Bello.\n",
"\n",
"* Besides refining and adapting the chord templates, another general strategy is to modify and enhance the chroma features extracted from the audio recordings to be analyzed. We have already seen at the beginning of this notebook that there are many **chroma variants** with quite different properties. The chroma type used has a strong influence on the chord recognition results, as has been demonstrated by Jiang et al. in their article on **Analyzing Chroma Feature Types for Automated Chord Recognition**.\n",
"\n",
"Some of these components and more elaborate procedures for automated chord recognition are discussed in the subsequent notebooks.\n",
"\n",
"* In the [FMP notebook on chord recognition evaluation](../C5/C5S2_ChordRec_Eval.html), we introduce some evaluation measures and discuss further phenomena and examples.\n",
"* In the [FMP notebook on HMM-based chord recognition](../C5/C5S3_ChordRec_HMM.html), we discuss a chord recognition procedure that applies a context-aware postfiltering technique.\n",
"* In the [FMP notebook on the Beatles collection](../C5/C5S3_ChordRec_Beatles.html), we present a case study that indicates the relevance of various chord recognition components. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert\" style=\"background-color:#F5F5F5; border-color:#C8C8C8\">\n",
"<strong>Acknowledgment:</strong> This notebook was created by <a href=\"https://www.audiolabs-erlangen.de/fau/professor/mueller\">Meinard Müller</a> and <a href=\"https://www.audiolabs-erlangen.de/fau/assistant/weiss\">Christof Weiß</a>.\n",
"</div> "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"border:none\">\n",
"<tr style=\"border:none\">\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C0/C0.html\"><img src=\"../data/C0_nav.png\" style=\"height:50px\" alt=\"C0\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C1/C1.html\"><img src=\"../data/C1_nav.png\" style=\"height:50px\" alt=\"C1\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C2/C2.html\"><img src=\"../data/C2_nav.png\" style=\"height:50px\" alt=\"C2\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C3/C3.html\"><img src=\"../data/C3_nav.png\" style=\"height:50px\" alt=\"C3\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C4/C4.html\"><img src=\"../data/C4_nav.png\" style=\"height:50px\" alt=\"C4\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C5/C5.html\"><img src=\"../data/C5_nav.png\" style=\"height:50px\" alt=\"C5\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C6/C6.html\"><img src=\"../data/C6_nav.png\" style=\"height:50px\" alt=\"C6\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C7/C7.html\"><img src=\"../data/C7_nav.png\" style=\"height:50px\" alt=\"C7\"></a></td>\n",
" <td style=\"min-width:50px; border:none\" bgcolor=\"white\"><a href=\"../C8/C8.html\"><img src=\"../data/C8_nav.png\" style=\"height:50px\" alt=\"C8\"></a></td>\n",
"</tr>\n",
"</table>"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}