Architecture

How HumanVAD Works

A hybrid approach combining prosodic analysis with German-specific semantic understanding for state-of-the-art voice activity detection.

Prosodic Analysis

Weight: 40-50%

Analyzes the acoustic properties of speech that carry meaning beyond words: pitch contours, energy patterns, speaking rate, and rhythm.

Pitch Contour (F0)

Terminal falling pitch indicates statement completion in German

RMS

Energy Envelope

RMS energy decline signals end of utterance

SPD

Speaking Rate

Pre-boundary lengthening in German final syllables

FFT

Spectral Analysis

Frequency distribution changes at turn boundaries

Semantic Analysis

Weight: 50-60%

Analyzes the linguistic content of the transcript to determine whether a German sentence is grammatically and semantically complete.

Verb Position (V2 Rule)

Checks if the finite verb is in second position (main clause)

Subclause Completion

Detects open subordinate clauses (weil, dass, obwohl...)

Separable Prefix

Tracks separable verb prefixes (an|rufen, auf|machen)

Politeness Markers

Detects turn-yielding cues ("bitte", "danke", "oder?")

Pipeline

Processing Pipeline

Audio Input

16kHz PCM

FFT

Spectral Analysis

Prosody

Pitch, Energy
Rate, Rhythm

Semantic

German NLP
Completeness

Decision

Weighted
Fusion

Output

interrupt / wait
+ confidence

Audio Input

16kHz PCM

FFT / Spectral Analysis

Prosody Extraction

Pitch / Energy / Rate

Semantic Parser

German NLP / Completeness

Decision Engine

Weighted Fusion

Output

interrupt / wait + confidence

Reference

API Reference

detect(audio_frame)

Basic voice activity detection on a single audio frame.

result = vad.detect(audio_frame)
# Returns:
# {'is_speech': True, 'confidence': 0.94}

process_audio(chunk, transcript)

Full hybrid analysis with semantic turn-end detection.

result = vad.process_audio(chunk, transcript="...")
# Returns:
# {'action': 'interrupt'|'wait', 'score': 0.87}

stream(audio_source)

Real-time streaming with event-driven callbacks.

for event in vad.stream(audio_source):
    if event.turn_ended:
        handle_turn_end(event)

POST /api/detect

Hosted REST API endpoint for server-side detection.

curl -X POST https://zedigital-humanvad.fly.dev/api/detect \
  -F "file=@audio.wav"
# Returns JSON: speech_detected, segments, confidence

Examples

Integration Examples

Python

from humanvad import ExcellenceVADGerman

vad = ExcellenceVADGerman(
    turn_end_threshold=0.60,
    prosody_weight=0.45,
    semantic_weight=0.55
)

# Process streaming audio
for chunk in audio_stream:
    result = vad.process_audio(
        chunk,
        transcript=transcriber.get_text()
    )
    if result['action'] == 'interrupt':
        print(f"Turn ended: {result['score']:.1%}")
        agent.respond()

REST API (cURL)

# Health check
curl https://zedigital-humanvad.fly.dev/api/health

# Detect speech in audio file
curl -X POST \
  https://zedigital-humanvad.fly.dev/api/detect \
  -F "file=@recording.wav"

# Response:
# {
#   "speech_detected": true,
#   "segment_count": 2,
#   "speech_percentage": 74,
#   "avg_confidence": 0.967,
#   "processing_time_ms": 0.077
# }

Comparison

HumanVAD vs. Competitors

Feature	HumanVAD	WebRTC VAD	Silero VAD
Accuracy (German)	96.7%	82.1%	89.3%
Latency	0.077ms	1.2ms	8.5ms
Memory	34KB	128KB	2.1MB
German Support	Native	None	Basic
Semantic Analysis	Yes	No	No
Prosodic Analysis	Yes	No	No
Turn-End Detection	Yes	No	No
Throughput	13,061/sec	1,200/sec	340/sec
Open Source	MIT	BSD	MIT

Research

Research Foundation

Prosodic Turn-Taking Cues in German

Research on how pitch declination and pre-boundary lengthening signal turn completion in German dialogue.

V2 Word Order and Sentence Completeness

German's V2 constraint as a reliable indicator of main-clause completeness in spoken language.

Hybrid VAD Systems for Conversational AI

Multi-modal voice activity detection combining acoustic and linguistic features for improved turn-taking.

Energy-Based Voice Activity Detection

Efficient RMS-based detection methods optimized for real-time telephony applications.