Author 1, Author 2, Author 3, Author 4
Abstract: We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and is applicable to various audio signals. In this paper, we jointly reconstruct the full-band magnitude and phase by considering the bi-level relationships among the time-domain signal, its STFT coefficients, and its mel-spectrogram. The proposed method is formulated as a rigorous optimization problem and estimates the full-band magnitude based on the criterion used in GLA. Our experiments demonstrate the effectiveness of the proposed method on speech, music, and environmental signals.
Evaluation on Speech Signals :
This section shows examples of the speech signals reconstructed from mel-spectraograms with 80 bins. Original speech signals are from a subset of the TIMIT dataset [1] and resampled at 16 kHz. For more details, please refer our paper.
| Speech 1 | Speech 2 | Speech 3 | |
| Original | |||
| GLA (α=0.0) [2] | |||
| FGLA (α=0.9) [3] | |||
| ADMMGLA [4] | |||
| Gradient descent | |||
| Prop-mel (α=0.0) | |||
| Prop-mel (α=0.9) | |||
| Prop-full (α=0.0) | |||
| Prop-full (α=0.9) |
Evaluation on Music and Environmental Signals :
This section shows examples of reconstructed music signals and foley sounds. Original music signals are from the MASS dataset [5] and sampled at 44100 Hz. Original foley sounds are from the development sets of DCASE2023 Task 7 [6] and sampled at 22050 Hz. For more details, please refer our paper.
| Music | DogBark | Footstep | |
| Original | |||
| FGLA (α=0.9) [3] | |||
| Prop-full (α=0.9) |
References:
[1] P. Mowlaee, J. Kulmer, J. Stahl, and F. Mayer, “Single Channel Phase-Aware Signal Processing in Speech Communication: Theory and Practice,” Wiley, 2016.
[page]
[2] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984.
[paper]
[3] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), Oct. 2013, pp. 1-4.
[paper]
[4] Y. Masuyama, K. Yatabe, and Y. Oikawa, “Griffin-Lim like phase recovery via alternating direction method of multipliers,” IEEE Signal Process. Lett., vol. 26, pp. 184-188, Jan. 2019.
[paper]
[5] M. Vinyes, “MTG MASS database,” 2008.
[page]
[6] K. Choi, J. Im, L. Heller, B. McFee, K. Imoto, Y. Okamoto, M. Lagrange, and S. Takamichi, “Foley sound synthesis at the dcase 2023 challenge,” 2023.
[page]