Skip to content

Spectrogram and STFT

Almost every MIR feature in libsonare is computed from a time-frequency representation of the audio. That includes chroma, mel, MFCC, onset strength, and even BPM and key.

Understanding this one foundation explains why so many functions share the same nFft and hopLength parameters. This page goes one level deeper than the map in MIR Overview.

From waveform to frequencies: the FFT

A waveform is amplitude over time; it does not tell you which frequencies are present. The Fourier Transform answers that — it decomposes a signal into the sine waves that sum to make it, like a prism splitting light into colors. The FFT (Fast Fourier Transform) is just an efficient algorithm for computing it on a block of samples.

The catch: one FFT describes a whole block as if its frequency content never changed. Music changes constantly, so a single FFT of a song is almost useless.

Tracking change over time: the STFT

The Short-Time Fourier Transform (STFT) fixes this by chopping the audio into short, overlapping windows and running an FFT on each. The result is a 2-D grid — frequency content for each moment in time. A spectrogram is that grid drawn as an image: time on one axis, frequency on the other, energy as brightness.

STFT · SPECTRALIDLE
STFT — seeing time and frequency at once

A tone sweeping from 220 Hz to 4 kHz. Each column is one short-time spectrum; brighter means more energy at that frequency.

Two parameters appear everywhere because they control this grid:

ParameterWhat it setsTrade-off
nFftWindow size in samples (e.g. 2048)Bigger = finer frequency detail, but blurrier timing
hopLengthStep between windows (e.g. 512)Smaller = more frames per second (smoother motion), more CPU

This is the time–frequency resolution trade-off: you cannot have perfect frequency and perfect time resolution at once. It is physics, not a libsonare limitation.

Windowing

Each block is multiplied by a window function (such as a Hann window) that tapers its edges to zero before the FFT. Without it, the abrupt block boundaries leak energy across frequencies (spectral leakage) and smear the result. Windowing is why frames overlap — the tapered edges are filled in by neighboring frames.

Musical spacing: CQT and VQT

A standard STFT spaces its frequency bins evenly in Hz. Musical notes are different: they are spaced logarithmically, because each octave doubles in frequency.

That means an even-Hz grid can waste resolution high up and lack resolution down low.

TransformHow it spaces frequencyUse when
STFTEvenly in HzYou want a general time-frequency view
CQTBy musical interval, often one set per semitonePitch relationships matter
VQTLike CQT, but it loosens the strict per-note spacing in the low range so low notes do not smear in timeYou want CQT-style pitch-aware bins but cleaner timing on bass and low percussion
How libsonare computes the STFT

libsonare's STFT and framing utilities apply a window (Hann by default), advance by hopLength, and run a real FFT per frame, producing the magnitude/power spectra that mel, chroma, onset, and tempogram stages reuse. Because the intermediate spectrogram is shared, asking for several features on one source does not recompute the FFT each time. The nFft/hopLength defaults (2048/512) mirror common librosa usage so reference tests can compare outputs. CQT/VQT use log-frequency bins layered on top of the same framing conventions.

Related: MIR Overview, Chroma Features, Mel, MFCC, and Timbre, Audio Basics