We provide an executive summary of the paper by Michal Aharon, Michael Elad and Alfred Bruckstein
Suppose a signal is denoted by a numeric vector \(\mathbf{y} \in \mathbb{R}^n\), the sparse representation of the signal is \(\mathbf{y} \approx \mathbf{D} \mathbf{x}\), where \(\mathbf{D} \in \mathbb{R}^{n \times K}\) is called the overcomplete dictionary matrix assuming \(n < K\) and \(\mathbf{x} \in \mathbb{R}^K\) is a sparse vector. In other words, the goal is to represent \(\mathbf{y}\) as a sparse linear combination of the \(K\) prototype signal atoms \(\{\mathbf{d}_j\}^K_{j=1}\) that form columns of the dictionary \(\mathbf{D}\).
The application of sparse representation is to train a proper dictionary \(\mathbf{D}\) based on a set of signals \(\mathbf{Y} = \{\mathbf{y}_j\}_{j=1}^N\), so that all newly coming signals can be possibly stored at minimum space as a sparse vector by taking advantage of this dictionary. This can be formulated as,
\[\begin{equation} \min_{\mathbf{D}, \mathbf{X}} \left\{ ||\mathbf{Y} - \mathbf{DX}||_F^2 \right\}, \text{ s.t. } \forall i, ||\mathbf{x}_i||_0 \leq T_0, \label{eq:formula} \end{equation}\]where \(\mathbf{X} = \{\mathbf{x}_i\}_{i=1}^{N} \in \mathbb{R}^{K \times N}\), and \(T_0\) defines the sparsity requirement. The objective targets to minimise the Frobenius norm by optimising the choice of dictionary \(\mathbf{D}\) and sparse matrix \(\mathbf{X}\).
The algorithms of solving the sparse representation problem (\ref{eq:formula}) typically involves iteratively solving \(\mathbf{X}\) and \(\mathbf{D}\) in sequence, given the other variable as a known parameter. In fact, literatures denote the corresponding two subproblems,
Sparse coding is to compute the representation coefficients \(\mathbf{x}\) based on the given signal \(\mathbf{y}\) and the dictionary \(\mathbf{D}\) by solving \eqref{eq:formula}.
Dictionary update is to search for better dictionary \(\mathbf{D}\) based on the given signals \(\mathbf{Y}\) and the solved coefficients matrix \(\mathbf{X}\) in the previous sparse coding stage. By “better” we mean to further minimise the objective in \eqref{eq:formula}.
The K-SVD algorithm introduced by Michal Aharon, Michael Elad and Alfred Bruckstein
To see this, we assume that both \(\mathbf{X}\) and \(\mathbf{D}\) are fixed, and we would like to update only one column of dictionary \(\mathbf{d}_k\), and the associated sparse representations \(\mathbf{x}^k_T\), namely the \(k\)-th row in \(\mathbf{X}\). Then the objective in (\ref{eq:formula}) is rewritten as,
\[\begin{align} ||\mathbf{Y} - \mathbf{D} \mathbf{X}||_F^2 & = \left\Vert \left( \mathbf{Y} - \sum_{j \neq k} \mathbf{d}_j \mathbf{x}^k_T \right) - \mathbf{d}_k \mathbf{x}_T^k \right\Vert_F^2 \\ & := \left\Vert \mathbf{E}_k - \mathbf{d}_k \mathbf{x}_T^k \right\Vert_F^2. \label{eq:errorUpdateD} \end{align}\]The matrix \(\mathbf{E}_k\) represents the error for all the \(N\) samples when the \(k\)-th atom is removed. We can use SVD to find the closest rank-1 matrix that approximates \(\mathbf{E}_k\), which effectively minimizes (\ref{eq:errorUpdateD}). However, the resulting \(\mathbf{x}_T^k\) is unlikely to fulfil the sparsity requirements. The remedy is to require that the new \(\tilde{\mathbf{x}}_T^k\) to have the same support as the original \(\mathbf{x}_T^k\). To add the requirement, we revise the objective (\ref{eq:errorUpdateD}) to
\begin{equation} \left\Vert \mathbf{E}_k \mathbf{\Omega}_k - \mathbf{d}_k \mathbf{x}_T^k \mathbf{\Omega}_k \right\Vert_F^2, \end{equation} where
\[\begin{equation} \mathbf{\Omega}_k = \begin{cases} 1 \text{, } i = \omega_k(j); \\ 0 \text{, otherwise;} \end{cases} \end{equation}\]and that the support of \(\mathbf{x}_T^k\) is defined as \(\omega_k := \{i: 1 \leq i \leq K, \text{ } \mathbf{x}_T^k \neq 0 \}\). Multiplying \(\mathbf{\Omega}_k\) shrinks \(\mathbf{x}_T^k\) by discarding zero entries and also refines \(\mathbf{E}_k\) to a selection of error columns that correspond to samples that use the atom \(\mathbf{d}_k\).
Thereafter, we can decompose \(\mathbf{E}_k \mathbf{\Omega}_k = \mathbf{U} \Delta \mathbf{V}^\top\), and update:
\(\mathbf{d}_k\) as the first column of \(\mathbf{U}\);
the non-zero elements of \(\mathbf{x}_T^k\) as the first column of \(\mathbf{V}\) multiplied by the largest singular value, namely \(\Delta (1,1)\)
By far, the K-SVD algorithm completes updating the dictionary, which will be used in the next iteration for sparse coding, unless the convergence requirement is satisfied (or maximum iterations is reached).
We’ve implemented the K-SVD algorithm in Python (see the Appendix). Meanwhile, we solve the sparse coding problem by taking advantage of the OrthogonalMatchingPursuit()
implementation maintained by Scikit-learn
. Together we are able to iteratively solve the sparse representation problem.
To test the algorithm and our implementation, we follow the same way as described in \cite{aharon} to simulate synthetic data (section V.A.) and measure performance (equation (25) in section V.D.). Moreover, we define a more general performance measure
\[\begin{equation} P(\epsilon) = \sum_{k=1}^K \mathbf{1}_{\{1-|\mathbf{d}_i^\top \tilde{\mathbf{d}}_i|< \epsilon \} }, \label{eq:performance} \end{equation}\]which counts the number of “successes” in recovering the dictionary. Here \(\mathbf{d}_i\) is the generating dictionary atom and \(\tilde{\mathbf{d}}_i\) is its corresponding element (closest column in \(\ell^2\) norm) in the recovered dictionary.
We need to initialise a dictionary to feed into the sparse coding process. Therefore it is reasonable to investigate the sensitivity of the result against the dictionary initialisation. Nevertheless, we show the result from one single trial first. In Figure 1, we plot the relative error (\(\|\mathbf{Y}-\mathbf{DX}\|_F / \|\mathbf{Y}\|_F\)) over iterations on the left, and the histogram of column \(\ell^2\)-norm errors on the right. The errors tend to converge into a positive constant. More than that, dictionary update by K-SVD algorithm greatly reduces the error during each iteration. Unfortunately, the orthogonal matching pursuit algorithm in the next iteration always brings the error back to a high level. We are thinking that this could be caused by that the function implemented by Skicit-learn
does not allow us to initialize \(\mathbf{X}\) from previous iteration, so that it always leads to another worse local minimum. Looking at the histogram, we observe that 95% of the recovered signals (columns) have less than 20% errors from the original signals.
Figure 1 - Error.
Next we look at the performance measure as defined in (\ref{eq:performance}). In Figure 2, we show how the number of successes vary with tolerance. With \(\epsilon \geq 0.06\), we can obtain at least 45 successes out of 50 for dictionary recovery.
Figure 2 - Count of successes.
As mentioned before, it is reasonable to investigate the sensitivity of the result against the dictionary initialization. We initialize the dictionary with i.i.d. uniformly distributed entries. Each column is then normalized to a unit \(\ell^2\)-norm. We show in Figure 3 how error evolves over iterations for 50 different initialisation. It turns out the errors are reasonably consistent by the end of iteration.
Figure 3 - Errors of multiple trials.