v2.0 -- Production Ready

German Voice
Detection.
96.7% Accurate.

The most accurate voice activity detection for German speech. Built on neuroscience research, optimized for real-time telephone conversations.

audio_stream.wav
LIVE
Voice: 2.3s Silence: 0.8s
Voice Detected
Silence
96.7% confidence
0%
Accuracy
0ms
Latency
0KB
Memory
Open Source
Technology Stack

Built With Leading Technology

Enterprise-grade audio processing infrastructure

Python Python
PyTorch PyTorch
NumPy NumPy
pip Package Manager
GitHub GitHub
WebRTC WebRTC
The Problem

Generic VAD fails on German

German's complex sentence structure -- with verb-final clauses, separable prefixes, and compound words -- confuses standard voice activity detectors. They trigger too early, interrupt mid-sentence, or miss turn endings entirely.

WebRTC VAD: 82.1% accuracy on German calls
No semantic understanding of sentence completeness
Interrupts during pauses within German subclauses
The Solution

Prosody + Semantic Analysis

HumanVAD combines prosodic analysis (pitch, energy, rhythm) with German-specific semantic analysis to determine when a speaker has truly finished their thought -- not just paused mid-sentence.

96.7% accuracy on German telephone calls
Understands verb-final and subclause structures
130x faster than real-time processing
Core Features

Built for German Voice AI

Hybrid Detection

Prosodic analysis (40-50%) combined with semantic analysis (50-60%) for maximum accuracy.

Domain-Specific

100% accuracy on Banking, Medical, and Restaurant calls. Optimized per domain.

Real-Time Streaming

0.077ms latency with 13,061 chunks/sec throughput. Process audio as it arrives.

Lightweight

34KB memory footprint. No GPU required. Runs on embedded devices and edge servers.

Benchmarks

HumanVAD vs. Competitors

Accuracy (German Telephone)
HumanVAD
96.7%
Silero VAD
89.3%
WebRTC VAD
82.1%
Latency (lower is better)
HumanVAD
0.077ms
WebRTC VAD
1.2ms
Silero VAD
8.5ms
Memory Usage (lower is better)
HumanVAD
34KB
WebRTC VAD
128KB
Silero VAD
2.1MB
Domain Testing

Accuracy by Domain

Banking
100%
Medical
100%
Restaurant
100%
Customer Service
100%
Retail
100%
General
90.9%
Travel
87.5%
Quick Start

Get started in 30 seconds

Install from PyPI, initialize the detector, and start processing audio. Three lines of code to production-ready German VAD.

1 Install the package from PyPI
2 Initialize with domain-specific settings
3 Process audio chunks in real-time
main.py
1from humanvad import ExcellenceVADGerman
2
3# Initialize with default settings
4vad = ExcellenceVADGerman(turn_end_threshold=0.60)
5
6# Process audio with transcript
7result = vad.process_audio(
8 audio_chunk,
9 transcript="Das Hotel hat fünfzig Zimmer."
10)
11
12if result['action'] == 'interrupt':
13 agent.start_speaking()
Live Preview

See HumanVAD in Action

Processing German Audio
banking_call_042.wav
Turn End Detected

"Ich möchte gerne mein Konto überprüfen, bitte."

96.7%

confidence

Open Source

Free to Use. Professional Support Available.

HumanVAD is MIT licensed and free forever. Professional support plans fund continued development and research.

Enterprise Support
€199/mo

Dedicated channel, 4hr/mo consulting, SLA, on-call support

API Access
€0.002/req

Pay-per-use hosted API at zedigital-humanvad.fly.dev

FAQ

Frequently Asked Questions

HumanVAD is specifically optimized for German. The prosodic and semantic analysis models are trained on German telephone conversations. The basic energy-based VAD will work on any language, but the hybrid accuracy advantage is German-specific.

HumanVAD accepts 16kHz PCM audio (16-bit mono). The library includes utilities to convert from other formats. The hosted API accepts WAV, WebM, and MP3 files and handles conversion automatically.

Yes. HumanVAD is released under the MIT license. You can use it in any project, commercial or otherwise, without restriction. Attribution is appreciated but not required.

The hosted API at zedigital-humanvad.fly.dev provides the same detection capabilities via REST endpoints. It handles audio format conversion, scales automatically, and requires no local installation. It charges €0.002 per request.

HumanVAD processes audio in chunks as small as 10ms. For the hybrid prosodic + semantic analysis, at least 200ms of audio context is recommended for optimal accuracy. The streaming API accumulates context automatically.