A Frequency‐Uniform and Pitch‐Invariant Time‐Frequency Representation

We introduce the terms frequency‐uniformity and pitch‐invariance in order to characterize time‐frequency representations. A frequency‐uniform representation has the property that it displays Dirac transients as a straight line in the spectrogram, while a pitch‐invariant representation translates pitch change into shifts, which is adequate for melodic instruments. We propose a novel representation that fulfills both criteria.


Analysis of existing representations
One building block for time-frequency representations is the short-time Fourier transform (STFT), given by (cf. [1]): V w X(t, f ) = F(X · w(· − t))(f ), X ∈ S (R), w ∈ S(R), where S(R) is the Schwartz space, and S (R) is its dual, the space of tempered distributions. It can be shown via standard arguments that V w X ∈ C ∞ (R × R) ∩ S (R × R). Definition 2.1 A time-frequency representation U : Proposition 2.2 For any w ∈ S(R), the time-frequency representation given by U stft w X = |V w X| 2 is frequency-uniform.
This condition ensures that a signal with a constant frequency spectrum will have a frequency-independent footprint in the spectrogram. This is important when representing short, percussive sounds. Voice and more melodic musical instruments such as wood and string instruments make sounds that consist mainly of sinusoids. These instruments are often capable of producing similar sounds at different pitch. When modulating the fundamental frequency, the frequencies of the sinusoids are multiplied with a constant factor, resulting in a dilation along the frequency axis of the STFT spectrogram.
Since dilations are hard to handle computationally, more "musically meaningful" representations have been developed: • The mel spectrogram is derived from the STFT spectrogram by applying smoothing and a logarithmic transform along the frequency axis: • The constant-Q transform [2] is a wavelet transform: with arbitrary ν > 0, the spectrogram U x ν (t, log f ) only depends on the ratio ν/f .
The constant-Q transform is pitch-invariant for all w ∈ S(R).
P r o o f. The latter claim is shown by substitution: For the first statement, we note that |V w x ν (t, f )| 2 = 2πσ 2 e −(2πσ(f −ν)) 2 /2 . When convolving two Gaussians, their means and variances add and their L 1 -norms multiply, yielding U mel Regarding the response to a Dirac distribution, we get U cq w δ(t, log f ) = |w 1/f (−t)| 2 , which is only constant in f for the trivial w = 0. Thus, the constant-Q transform is not frequency-uniform.
In the mel spectrogram, the response is U mel −∞ Λ f , which can be frequency-uniform if the integral over Λ f is independent of f . However, for Gaussian w, it is not possible to have frequency-uniformity and pitchinvariance at the same time.

Proposed representation Proposition 3.1 The time-frequency representation given by
is both frequency-uniform and pitch-invariant for f > f 0 > 0.
P r o o f. We first check for frequency-uniformity. By adding the variances of the Gaussians, we get: . Now we apply the representation to a complex exponential. Via the result from the constant-Q transform, we obtain: Thus, our representation is also pitch-invariant. The representation generalizes the first layer of the scattering transform [3]. Independently, a related transform was developed by Dörfler et al. [4], which is equivalent to the mel spectrogram and therefore not frequencyuniform and pitch-invariant at the same time.

Conclusion
Our representation combines favorable properties from existing transforms and is thus particularly suitable for the analysis of audio signals with both percussive and melodic components (cf. Fig. 1). However, due to the Heisenberg uncertainty principle, it only reaches down to a certain frequency f 0 , and the convolution makes its inversion an ill-posed problem.