Skip to content

MIR Overview

MIR means Music Information Retrieval: the part of audio analysis that turns sound into musical answers — tempo, beat positions, key, chords, pitch, timbre, and structure. This page is a map. It groups the terms you will meet across the docs and shows how they build on one another, so you know which feature to reach for and where it is computed.

These terms are grouped on purpose. They are not isolated functions — almost every MIR task is built on the same time–frequency foundation. Understanding that shared foundation once means the individual features stop looking like a long, unrelated list.

New here? Read this as orientation, not reference

This page explains how the pieces relate. For call signatures, go to the JavaScript API or Python API; for how each one is computed, see DSP Implementation Notes.

The shared pipeline

Most MIR features are derived from a small set of intermediate representations. You rarely build these by hand — libsonare computes them internally — but seeing the flow explains why so many features share parameters like nFft and hopLength.

Because these intermediates are shared, asking for BPM, key, chord, and section results back-to-back on the same source does not repeat the heavy work — the STFT and friends are computed once and reused.

Which question, which feature

You want to answer…Reach forBuilt on
How fast is it? Where are the beats?BPM / beat trackingonset strength
What key is it in?key detectionchroma
What chord is playing?chord recognitionchroma
Where does the chorus start?section analysisrhythm + harmony + timbre
What note is the melody?pitch / melody trackingpitch tracking (see Separation and pitch below)
What does it sound like (timbre)?MFCCmel spectrogram
Can I separate drums from the rest?HPSSspectrogram structure
What is the raw frequency content over time?STFT / spectrogramthe waveform
What does the recording space sound like?room-acoustic analysisimpulse-response decay or blind free-decay estimates

Timing: BPM, beat, onset, section

The timing family builds up in layers:

FeatureWhat it answers
Onset detectionWhere notes, drums, or consonants begin: the spikes in an onset-strength envelope.
BPMHow periodic those onsets are.
Beat trackingWhere pulses land on the timeline.
Section analysisWhere longer spans such as intros, verse-like sections, chorus-like sections, and breaks begin and end.

Onset is the root of the rhythm family

BPM, beats, and tempograms all start from the same onset-strength envelope. If you want the time × tempo picture behind a BPM estimate, see the tempogram family in Realtime and Streaming.

DETECTOR · ONSET / BEATIDLE
Onsets vs beats — from attacks to a pulse

Onset detection marks every attack in the audio; beat tracking distils those into the steady pulse you would tap along to. Switch the view, then press play to watch each marker fire as the playhead reaches it.

Detect

Harmony: key, chord, chroma

Chroma compresses frequency content into 12 pitch-class bins (C, C♯, … B), folding every octave of the same note together. That makes it the natural substrate for harmony: key detection estimates the tonal center from the overall chroma distribution, and chord recognition estimates local harmony frame by frame.

Chroma trades octave and timbre detail for harmonic clarity

Folding octaves together is exactly what makes chroma good for key/chord work — and exactly what makes it the wrong tool for melody or timbre, where octave and spectral shape matter. Match the representation to the question.

CHROMA · PITCH CLASSIDLE
Chromagram — harmony folded into 12 bins

Every frequency is folded onto one of twelve pitch classes, so octave is forgotten and only the harmony remains. This clip walks a C–Am–F–G turnaround: watch the lit rows shift as each chord changes, then play to follow the progression.

Spectrum: FFT, STFT, spectrogram

The FFT is an efficient algorithm for the DFT (Discrete Fourier Transform), which converts a block of samples into frequency content.

The STFT repeats that over many short, overlapping windows so frequency content can be tracked over time.

A spectrogram is the visual result: time on one axis, frequency on another, intensity as brightness.

Two parameters recur everywhere: nFft (window size — bigger means finer frequency resolution but blurrier timing) and hopLength (step between windows — smaller means more frames and smoother motion). The trade-off between frequency and time resolution is fundamental, not a libsonare quirk.

STFT · SPECTRALIDLE
STFT — seeing time and frequency at once

A tone sweeping from 220 Hz to 4 kHz. Each column is one short-time spectrum; brighter means more energy at that frequency.

Perceptual features: mel, MFCC, CQT, VQT

These perceptual features answer different questions:

FeatureWhat it emphasizes
Mel spectrogramFrequency resolution shaped toward human hearing: fine detail low, coarser detail high.
MFCCsA compact "timbre fingerprint". It is computed by taking the mel spectrogram, compressing its loudness with a logarithm, then summarizing each frame into a handful of numbers that capture overall spectral shape rather than exact pitch.
CQT / VQTMusically spaced bins, useful when pitch relationships matter more than equal-Hz spacing.

You can also run these transforms backwards for previews and debugging — see Inverse Features.

Separation and pitch: HPSS and pitch

HPSS means Harmonic/Percussive Source Separation. It splits sustained pitched material from transient hits by using their spectrogram shapes.

ComponentSpectrogram shape
Harmonic contentMostly horizontal lines.
Percussive contentMostly vertical lines.

Separating them first often improves downstream tasks because drums and pitched instruments otherwise confuse each other.

HPSS · FULL MIXIDLE
HPSS — splitting the tune from the drums

On a spectrogram, sustained pitched notes draw horizontal ridges while drum hits draw vertical streaks. HPSS exploits exactly that: median-filtering along time keeps the horizontal (harmonic) content, along frequency keeps the vertical (percussive) content. Switch the view — Full shows both, Harmonic keeps the ridges (the chords and bass, drums gone), Percussive keeps the streaks (the kit, tune gone) — and press play to hear each layer on its own. Separating them first often cleans up downstream beat or pitch tracking.

Layer

Pitch estimation tracks the fundamental frequency — the lowest, strongest frequency that we hear as the note's pitch — useful for melody, vocals, monophonic instruments, tuning checks, and transcription-style workflows.

Adjacent: room acoustics

Room-acoustic analysis is adjacent to MIR. It describes the space captured by the recording rather than the notes, rhythm, or form of the music.

Use direct IR analysis when you have a clean impulse response. That path measures RT60, EDT, C50, C80, D50, and band decay.

Use blind acoustic estimation when you only have a normal recording. That path reports room-decay cues with a confidence value because the free-decay evidence may be weak or missing. See Room Acoustics.

Implementation notes

libsonare exposes MIR functions across browser/WASM, JavaScript, Python, native bindings, CLI, and C++ APIs.

Many features share intermediate representations such as STFT, chroma, and spectral energy curves. Asking for BPM, key, chord, and section results back-to-back on the same source does not repeat the heavy work; the intermediates are computed once and reused.

The browser demos are built for interactive use, but each one emphasizes a different part of the library:

DemoMain role
Music Analysis StudioFull-file MIR: BPM, key, chords, sections, and related analysis.
Realtime viewsProgressive BPM, key, and chord estimates through StreamAnalyzer.
Mastering StudioMeasurement-style APIs such as loudness measurement, reference comparison, and report export.

Seeing those demos side by side shows which pieces are reusable across analysis work and finishing work.

Related: Introduction, Audio Basics, JavaScript API, Room Acoustics, DSP Implementation Notes, librosa Compatibility