Skip to content
All Projects
AI EngineeringProduction

AI Speech-to-Text Journaling API

Turning spoken words into structured, speaker-attributed intelligence

A RESTful API that transforms raw audio into speaker-attributed transcripts using a multi-model pipeline Whisper for transcription, PyAnnote for diarization, and Gemma-3 for insight extraction.

Faster-WhisperPyAnnote 3.1Gemma-3FastAPICeleryRedispydubDocker
Problem Statement

Manual note-taking loses context and accuracy, offers no speaker attribution, and turns long recordings into hours of work while inconsistent formats hinder downstream search and analytics.

  • Manual note-taking loses context and accuracy.
  • No attribution who said what is unclear.
  • Long recordings take hours to process by hand.
  • Inconsistent formats hinder searchability.
Headline Outcomes
~$0.1099% saved

Cost per hour of audio

~$90 (typist)

8–12 min20× faster

Time-to-transcript (60 min)

3–4 hours

>90%PyAnnote 3.1

Speaker attribution accuracy

The Solution

A fully asynchronous, queue-backed API that accepts audio uploads and returns structured, speaker-attributed transcripts modular, scalable and format-agnostic, with three processing modes behind a single endpoint.

Faster-Whisper (large, int8) transcription int8 quantization halves CPU memory vs fp32.

PyAnnote 3.1 assigns per-speaker labels with millisecond timestamps.

Gemma-3-27B extracts structured insights on-device eliminating cloud API costs, RAG-ready output.

pydub splits long audio into overlapping 3-min chunks to prevent boundary-cut errors; Celery scales workers.

System Architecture

How the data flows

01

Audio Upload

6+ formats

02

Format Normalize

pydub conversion

03

Chunk & Queue

Celery + Redis

04

Whisper + PyAnnote

Transcribe + diarize

05

Structured Output

JSON + Markdown

Result 01

Auto-captures meeting minutes, call records and interview logs at scale.

Result 02

Zero vendor lock-in fully open-source stack, cloud-deployable with no infra changes.

Result 03

Plug-and-play JSON API feeds LLM summarisation and RAG pipelines downstream.

Available for new work

Have a backend, AI, or data problem worth solving?

From production APIs to self-hosted AI that kills per-call costs let's scope it. I reply within one business day.