Building Aurel-AI: A Personal AI Assistant with Ollama, TTS, and STT with 3D Model

Voice-driven assistants are no longer a luxury feature — they’re becoming the default way we interact with AI. When we started designing Aurel-AI, our goal was to create a personal, private, uncensored AI assistant that runs locally and integrates seamlessly with Ollama while also supporting speech-to-text (STT) and text-to-speech (TTS) for natural, conversational interaction.

In this post, we’ll walk through:

The architecture of Aurel-AI
How we wired up frontend ↔ backend ↔ Ollama ↔ TTS/STT
Key development challenges and solutions
A diagram of the system workflow

Why Ollama?

We chose Ollama because it’s:

Local-first (your data stays private)
Lightweight and easy to integrate with custom APIs
Flexible with multiple open-source models

This makes it the perfect backbone for an uncensored personal assistant that you fully control.

Core Workflow

The assistant works in a loop of voice and text:

User speaks a query
→ STT converts speech into text.
Backend API sends the text to Ollama
→ Ollama generates a response.
Backend transforms Ollama’s text output into speech
→ TTS generates audio for playback.
Frontend delivers both text + audio to the user
→ User can continue by speaking again (continuous conversation).

System Architecture Diagram

Here’s a simplified view of the workflow:

flowchart TD
    A[User Voice Input 🎤] --> B[Frontend STT API Call]
    B --> C[Backend STT Engine]
    C --> D[Text Query]
    D --> E[Backend API → Ollama]
    E --> F[Ollama LLM Response]
    F --> G[Backend TTS Engine]
    G --> H[Audio Response 🔊 + Text Display]
    H --> A

This loop creates the real-time voice assistant experience.

Frontend

Built with Angular for a lightweight UI.
Microphone access for voice capture.
Displays chat history (like a messenger).
Streams audio responses from the backend in near real-time.

Backend

Node.js/Express server that connects everything together.
Routes for:
- /stt → handles user voice input, passes to STT engine.
- /chat → sends text queries to Ollama and streams back results.
- /tts → converts Ollama’s responses into playable audio.
Manages session context so conversations feel natural.

Speech-to-Text (STT)

For STT, we used:

Whisper.cpp (fast, runs locally, good accuracy)
Alternative: Google STT or OpenAI Whisper API (for cloud setups).

STT is crucial for converting raw audio into a text query Ollama can understand.

Text-to-Speech (TTS)

For TTS, we integrated:

Ollama-compatible TTS engines like Piper or Coqui TTS (local, lightweight).
Cloud alternatives (Polly, Azure TTS, ElevenLabs) for higher-quality voices.

The backend streams TTS audio chunks so the user hears responses quickly, even before the full text is generated.

Lessons Learned

Latency matters: Real-time assistants feel broken if TTS/STT lag behind. Streaming both text and audio solves this.
Privacy vs. quality: Local models (Whisper, Piper) are great for privacy, but cloud APIs often sound better. We made both options pluggable.
Context handling: Keeping track of the conversation is key. Ollama’s context window size and efficient prompt engineering made this manageable.

What’s Next?

We’re working on:

Adding wake word detection (hands-free use).
Custom personalities and voices.
Offline mobile version using on-device models.