Skip to content

Introduction

What is libsonare?

libsonare is an audio analysis library for Music Information Retrieval (MIR). Given an audio file or stream, it can automatically extract musical information such as tempo (BPM), key, chords, beats, song structure, and more.

It is written in C++17 for performance and can be compiled to WebAssembly, making it possible to run the same analysis directly in web browsers — no server required.

What is WebAssembly?

WebAssembly (WASM) is a binary instruction format that runs in web browsers at near-native speed. It allows code written in languages like C++ to be compiled and executed in the browser, without needing a server or installing anything. libsonare uses this to bring C++-level audio analysis performance directly to web applications.

What is Music Information Retrieval?

Music Information Retrieval (MIR) is a field of research that deals with extracting meaningful information from music. When you listen to a song, your brain effortlessly recognizes the tempo, identifies chord changes, and feels the song structure. MIR aims to do the same computationally.

Common MIR tasks include:

TaskQuestion it answers
Tempo estimationHow fast is this song? (e.g., 120 BPM)
Key detectionWhat key is this song in? (e.g., C major)
Beat trackingWhere exactly are the beats?
Chord recognitionWhat chords are being played and when?
Structural analysisWhere are the verse, chorus, and bridge?
Pitch trackingWhat notes are being sung or played?

libsonare provides all of these capabilities in a single library.

Who is it for?

libsonare is designed for developers building music-related applications:

  • Web app developers — Add audio analysis to your app without a backend. Detect BPM for a DJ tool, visualize chords for a practice app, or build an automatic song structure viewer, all running client-side in the browser via WebAssembly.

  • Music tool creators — Build DAW plugins, chord detectors, auto-transcription tools, or music education software. libsonare provides the analysis engine so you can focus on the user experience.

What is a DAW?

A DAW (Digital Audio Workstation) is software for recording, editing, and producing audio. Examples include Ableton Live, Logic Pro, FL Studio, and GarageBand. DAW plugins extend these applications with additional functionality like effects, instruments, or analysis tools.

  • Audio researchers — If you use librosa in Python, libsonare provides a familiar API with compatible parameters and algorithms, but runs at native C++ speed (tens of times faster, with parallel analysis and automatic CPU detection) and works in the browser.

  • Game / interactive media developers — Analyze music in real-time for rhythm games, music visualizers, or adaptive audio systems using the streaming API.

How Audio Analysis Works

Audio analysis is not a single step — it's a pipeline where each stage builds on the previous one. Understanding this flow helps you see what libsonare is doing under the hood and why certain features depend on others.

Audio Waveform

Spectral Analysis (STFT)

Feature Extraction (chroma, onset envelope, mel spectrogram, ...)

Musical Analysis (key, chords, BPM, beats, sections, ...)

Stage 1: Spectral Analysis

Raw audio is a sequence of amplitude values over time — a waveform. To extract musical meaning, the first step is to convert this into a spectrogram using the Short-Time Fourier Transform (STFT). The STFT breaks the audio into short overlapping windows and computes the frequency content of each, producing a 2D map of "which frequencies are present at each moment."

What is a Fourier Transform?

A Fourier Transform decomposes a signal into its constituent frequencies — like splitting white light through a prism into a rainbow. Audio is a sum of many frequencies (sine waves) at different amplitudes. The Fourier Transform reveals which frequencies are present and how strong they are. The Short-Time variant (STFT) applies this repeatedly to overlapping windows of the audio, so you can see how the frequency content changes over time.

This spectrogram is the foundation for everything that follows.

Stage 2: Feature Extraction

From the spectrogram, libsonare computes various features — each designed to capture a specific aspect of the audio:

  • Chroma — Folds all frequencies into 12 pitch classes (C, C#, D, ..., B), showing which notes are prominent at each moment. This is the basis for key and chord analysis.
What are pitch classes?

In music, a pitch class groups all octaves of the same note together. For example, every C note (C2, C3, C4, ...) belongs to the pitch class "C." There are 12 pitch classes in Western music: C, C#, D, D#, E, F, F#, G, G#, A, A#, B. A chroma feature represents audio as the energy distribution across these 12 classes at each moment, ignoring which octave the notes are in.

  • Onset Envelope — Measures how much the spectral energy changes between frames, highlighting moments where new notes or hits begin. This drives beat and rhythm analysis.
What is an onset?

An onset is the beginning of a musical event — the moment a note is struck, a drum is hit, or a new sound begins. The onset envelope is a continuous curve that peaks at these moments. By finding the peaks, we can detect individual note attacks, which in turn helps estimate tempo and find beat positions.

  • Mel Spectrogram — Re-maps the spectrogram to the mel scale, which matches human perception of pitch (we hear the difference between 200 Hz and 400 Hz as the same "distance" as 2000 Hz and 4000 Hz). Widely used in audio ML.
What is the mel scale?

The mel scale is a perceptual scale of pitch. Humans don't perceive frequency linearly — the jump from 100 Hz to 200 Hz sounds like a big change, but 5000 Hz to 5100 Hz sounds almost the same. The mel scale compresses higher frequencies to match this perception. A mel spectrogram uses this scale for its frequency axis, making it better suited for tasks that relate to how we actually hear sound.

  • MFCC (Mel-Frequency Cepstral Coefficients) — A compact representation of spectral shape derived from the mel spectrogram. Useful for classifying timbre and musical texture.
What is timbre?

Timbre (pronounced "TAM-ber") is what makes a piano and a guitar sound different even when playing the same note at the same volume. It's the "color" or "texture" of a sound, determined by the relative strengths of its harmonic frequencies. MFCC features capture this spectral shape in a compact form, making them useful for distinguishing between different instruments, voices, or sound types.

  • CQT / VQT — Transforms with frequency resolution that matches musical pitch, unlike the standard FFT where resolution is uniform across frequencies.
Why does musical pitch need special frequency resolution?

Musical notes are spaced logarithmically — the frequency doubles with each octave (A3 = 220 Hz, A4 = 440 Hz, A5 = 880 Hz). A standard FFT uses evenly spaced frequency bins, so it can't distinguish low notes well while wasting resolution on high notes. The Constant-Q Transform (CQT) spaces its bins logarithmically to match musical pitch, giving equal resolution per octave. The Variable-Q Transform (VQT) extends this with adjustable resolution.

These features are useful on their own (e.g., feeding a mel spectrogram to a machine learning model), but they also serve as input to higher-level analysis.

Stage 3: Musical Analysis

The high-level results that most users care about are built on top of the features from Stage 2:

AnalysisBuilt fromWhat it does
Key DetectionChroma → Krumhansl-Schmuckler algorithmDetermines the musical key (e.g., "A minor") by comparing chroma profiles against key templates
Chord RecognitionChroma → template matchingIdentifies chords over time by matching chroma frames against 108 chord type templates
BPM DetectionOnset envelope → tempogram + autocorrelationEstimates tempo by finding periodic patterns in onset strength
Beat TrackingOnset envelope → dynamic programmingFinds exact beat timestamps by optimizing for rhythmic regularity
Section DetectionChroma + spectral features → self-similaritySegments the song into Intro, Verse, Chorus, etc. by detecting boundaries where musical character changes
Pitch TrackingWaveform → YIN / pYIN algorithmEstimates the fundamental frequency (F0) of monophonic audio for melody extraction
What is a tempogram?

A tempogram is a time-tempo representation — it shows the strength of different tempo candidates at each moment. It is computed by analyzing the onset envelope for periodic patterns using autocorrelation (measuring how similar a signal is to a time-shifted version of itself). Peaks in the tempogram reveal the dominant tempo.

What is self-similarity analysis?

Self-similarity analysis compares every part of a song against every other part, building a matrix that shows how similar any two moments are. In a pop song, choruses tend to sound similar to each other but different from verses. By finding block-like patterns in this matrix, libsonare can identify section boundaries — where the music transitions from one structural part to another.

What is the fundamental frequency (F0)?

The fundamental frequency (F0) is the lowest frequency of a periodic sound — the frequency that determines the perceived pitch. When a singer sings an A4 note, the F0 is 440 Hz, even though many higher harmonics (880 Hz, 1320 Hz, ...) are also present. Pitch tracking algorithms like YIN and pYIN estimate this F0 over time to extract the melody line from audio.

This layered design means libsonare doesn't just give you answers — it exposes each stage, so you can use low-level features for your own analysis or plug them into ML pipelines.

Audio Effects

libsonare also provides audio processing capabilities that operate on the spectral representation:

  • HPSS (Harmonic-Percussive Source Separation) — Separates audio into harmonic (tonal) and percussive (rhythmic) components using median filtering on the spectrogram. Often used as a preprocessing step to improve analysis accuracy.
What are harmonic and percussive components?

In a spectrogram, harmonic sounds (vocals, strings, sustained notes) appear as horizontal lines — they maintain a stable frequency over time. Percussive sounds (drums, clicks, transients) appear as vertical lines — they contain many frequencies but only for a brief moment. HPSS exploits this difference using median filters: a horizontal median filter extracts the harmonic part, and a vertical median filter extracts the percussive part.

  • Time Stretch — Changes tempo without affecting pitch by manipulating the STFT phase.
  • Pitch Shift — Transposes pitch without affecting tempo.

Real-Time Streaming

The StreamAnalyzer runs the same pipeline on audio chunk-by-chunk with low latency, producing per-frame features suitable for real-time visualization or live analysis. It integrates with the Web Audio API's AudioWorklet for in-browser real-time processing.

Platform Support

PlatformInterfaceUse case
BrowserJavaScript/TypeScript (WebAssembly)Web apps, client-side analysis
Node.jsJavaScript/TypeScript (WebAssembly)Server-side processing
Node.jsJavaScript/TypeScript (N-API native addon)High-performance server-side processing
PythonPython (cffi)Data science, scripting, librosa migration
Linux / macOSC++Native applications, CLI tools
Any (via C API)CFFI integration with other languages

The WebAssembly build is ~508 KB (~196 KB gzipped) with no external dependencies.

Relationship with librosa

librosa is the de facto standard Python library for audio analysis in the MIR community. libsonare is designed as a compatible alternative for environments where Python isn't available or performance is critical:

What are sample rate, FFT size, and hop length?

These are the fundamental parameters of audio analysis. Sample rate is how many amplitude measurements per second the audio contains (e.g., 44,100 Hz for CD quality). FFT size (n_fft) is the number of samples in each analysis window — larger windows give better frequency resolution but worse time resolution. Hop length is how many samples the window moves between frames — smaller hops give more time detail but more computation. libsonare defaults to 22,050 Hz sample rate, 2,048 FFT size, and 512 hop length, matching librosa.

  • Default parameters (sample rate, FFT size, hop length, etc.) match librosa exactly
  • Core algorithms (STFT, mel spectrogram, MFCC, chroma) produce numerically compatible results
  • Function names and API patterns are intentionally similar for easy migration

libsonare also goes beyond librosa's scope with features like chord recognition, section detection, timbre analysis, and real-time streaming — capabilities that are typically separate tools in the Python ecosystem.

See librosa Compatibility for detailed comparison and migration guides.

Next Steps