Benchmarks

Performance comparison of libsonare against librosa (Python) for audio analysis tasks.

Methodology

All numbers below are measured standalone from raw audio — every call rebuilds whatever intermediate state it needs (STFT, Mel, etc.) from the original samples. This is the same code path a one-shot user of either API exercises, so the comparison is apples-to-apples. The full benchmark source and results JSON live in benchmarks/ inside the libsonare repo.

Hardware

Measured on Apple M5 Max (16 cores, 128 GB unified memory). Absolute times scale with hardware; relative differences are stable.

Full Pipeline Analysis

Complete music analysis: BPM + key + beats + chords + sections + timbre + dynamics + rhythm + melody.

Test audio: synthetic WAV, 73 seconds, 44100 Hz stereo (committed in benchmarks/generate_audio.py).

Full Analysis Latency (lower is better)

Library	Language	Time	Relative
libsonare	C++	0.67s	1x
bpm-detector `--comprehensive` (librosa-based)	Python	36.4s	~54x slower

The full-pipeline figure is where libsonare's design pays off most: shared spectrograms, parallel feature paths, automatic 44.1 → 22.05 kHz downsampling, and no Python boundary inside the pipeline.

Per-Feature Comparison

Individual feature extraction on the same 73-second audio (resampled to 22050 Hz). librosa measured with time.perf_counter, libsonare measured with chrono::steady_clock inside C++ via the sonare_bench binary.

Per-Feature Latency (lower is better)

Feature	librosa	libsonare	Speedup
STFT (2048, hop 512)	13.3ms	13.9ms	0.96x (tied)
Mel Spectrogram (128 bands)	20.4ms	23.0ms	0.88x (tied)
HPSS (kernel 31)	1,762ms	89ms	19.7x
Onset Strength	21.5ms	23.5ms	0.91x (tied)
Chroma (STFT-based)	44.6ms	15.4ms	2.9x
Beat Track	35.6ms	23.8ms	1.5x
MFCC (13 coefficients)	21.6ms	23.8ms	0.90x (tied)
pYIN	5,825ms	474ms	12.3x
Spectral Centroid	24.8ms	16.5ms	1.5x

Honest take

For the cheap STFT-dominated features (STFT, Mel, Onset, MFCC) libsonare is roughly tied with librosa. librosa delegates the FFT to scipy.fft (heavily optimized C/Fortran), so once you account for the FFT cost there isn't much room to win on a single call.

The clear wins are in (1) compute-heavy operations where libsonare's multithreaded median filter (HPSS) and parallelized Viterbi decoding (pYIN) leave the Python pipeline far behind, and (2) pipeline-level analysis where shared intermediates and the elimination of the Python boundary dominate.

Where the Big Wins Come From

Full pipeline (54x): shared intermediates + no Python

libsonare's analyze() computes the STFT and Mel spectrogram once and reuses them across every downstream analyzer — chord detection reads the same chromagram the key analyzer uses; the beat tracker reads the same onset envelope the section detector consumes. Independent paths run in parallel across CPU cores. None of this crosses the Python boundary, so per-call dispatch overhead is gone.

bpm-detector (and any other librosa-based pipeline) rebuilds these intermediates per analyzer and orchestrates everything from Python — the cost adds up.

HPSS (19.7x): cache-friendly multithreaded median filter

librosa's HPSS calls scipy.ndimage.median_filter once horizontally and once vertically — a general-purpose C implementation processed sequentially per pixel.

libsonare replaces this with a custom sliding median:

Sorted flat array with O(log k) binary search + O(k) memmove instead of a tree, which fits in L1 cache for typical kernel sizes
Multi-threaded execution — rows and columns processed in parallel across all cores
Result: ~20x faster end-to-end than the scipy version on this hardware

pYIN (12.3x): native Viterbi + parallelized candidates

pYIN's bottleneck is per-frame candidate evaluation and the Viterbi decoding step. libsonare implements both in C++ with parallelized frame processing, replacing librosa's Numba-JIT'd inner loop. The candidate evaluation in particular benefits from C++'s tighter loops and SIMD vectorization through Eigen3.

Chroma (2.9x): tighter STFT → filterbank path

Chroma derives a 12-pitch-class representation from the spectrogram via a constant-Q-like filterbank. libsonare's STFT and the filterbank multiplication run as Eigen3-vectorized matrix operations on a single contiguous buffer, avoiding the dispatch overhead of librosa's stack of NumPy operations.

What's Not Faster (And Why)

STFT itself: librosa delegates to scipy.fft, which is implemented in C/Fortran. The two are within noise of each other.
Mel / MFCC / Onset Strength: dominated by their underlying STFT cost — once that's paid, the per-frame Mel filterbank multiplication and DCT are too cheap for a different language to matter.
Pipeline use cases: inside analyze() these same features run in <1 ms apiece because the STFT/Mel is computed once and shared. The standalone numbers above represent the "what does it cost to call this in isolation" view, not the in-pipeline cost.

Reproduce These Numbers Yourself

The benchmark lives under libsonare/benchmarks/ and is fully reproducible:

bash

# in your local libsonare checkout
cmake -B build-bench -DCMAKE_BUILD_TYPE=Release \
                     -DBUILD_BENCH=ON -DBUILD_TESTING=OFF -DBUILD_CLI=OFF
cmake --build build-bench -j

rye sync --pyproject benchmarks/pyproject.toml
rye run --pyproject benchmarks/pyproject.toml python benchmarks/generate_audio.py

./build-bench/bin/sonare_bench \
    benchmarks/fixtures/bench_73s_44100.wav \
    benchmarks/results_cpp.json

rye run --pyproject benchmarks/pyproject.toml python benchmarks/run_bench.py

The merged benchmarks/results.json contains both the C++ libsonare numbers and the librosa numbers, plus the bpm-detector full-pipeline timing if bpm-detector is on PATH.

Calling libsonare from Python

The numbers above measure libsonare's native C++ performance. If you call individual feature functions through the Python cffi binding (e.g. libsonare.stft(samples, sr)), every call marshals the sample buffer across the FFI boundary, which dominates the runtime for cheap features. The full-pipeline analyze() is unaffected — it runs end-to-end in C++ and only the small result struct crosses the boundary.

Notes

Numbers are hardware-dependent. Apple M5 Max here; relative gaps are stable across machines.
Synthetic test audio (deterministic chord progression + percussive bursts) is generated locally rather than committed to the repo.
WASM builds are single-threaded — the HPSS and pYIN speedups shrink there but the pipeline-level win remains.

Benchmarks ​

Full Pipeline Analysis ​

Full Analysis Latency (lower is better)

Per-Feature Comparison ​

Per-Feature Latency (lower is better)

Where the Big Wins Come From ​

Full pipeline (54x): shared intermediates + no Python ​

HPSS (19.7x): cache-friendly multithreaded median filter ​

pYIN (12.3x): native Viterbi + parallelized candidates ​

Chroma (2.9x): tighter STFT → filterbank path ​

What's Not Faster (And Why) ​

Reproduce These Numbers Yourself ​

Notes ​