Building Aurel-AI: A Personal AI Assistant with Ollama, TTS, and STT with 3D Model

Voice-driven assistants are no longer a luxury feature — they’re becoming the default way we interact with AI. When we started designing Aurel-AI, our goal was to create a personal, private, uncensored AI assistant that runs locally and integrates seamlessly with Ollama while also supporting speech-to-text (STT) and text-to-speech (TTS) for natural, conversational interaction.

In this post, we’ll walk through:

  • The architecture of Aurel-AI
  • How we wired up frontend ↔ backend ↔ Ollama ↔ TTS/STT
  • Key development challenges and solutions
  • A diagram of the system workflow

Why Ollama?

We chose Ollama because it’s:

  • Local-first (your data stays private)
  • Lightweight and easy to integrate with custom APIs
  • Flexible with multiple open-source models

This makes it the perfect backbone for an uncensored personal assistant that you fully control.


Core Workflow

The assistant works in a loop of voice and text:

  1. User speaks a query
    → STT converts speech into text.
  2. Backend API sends the text to Ollama
    → Ollama generates a response.
  3. Backend transforms Ollama’s text output into speech
    → TTS generates audio for playback.
  4. Frontend delivers both text + audio to the user
    → User can continue by speaking again (continuous conversation).

System Architecture Diagram

Here’s a simplified view of the workflow:

flowchart TD
    A[User Voice Input 🎤] --> B[Frontend STT API Call]
    B --> C[Backend STT Engine]
    C --> D[Text Query]
    D --> E[Backend API → Ollama]
    E --> F[Ollama LLM Response]
    F --> G[Backend TTS Engine]
    G --> H[Audio Response 🔊 + Text Display]
    H --> A

This loop creates the real-time voice assistant experience.


Frontend

  • Built with Angular for a lightweight UI.
  • Microphone access for voice capture.
  • Displays chat history (like a messenger).
  • Streams audio responses from the backend in near real-time.

Backend

  • Node.js/Express server that connects everything together.
  • Routes for:
    • /stt → handles user voice input, passes to STT engine.
    • /chat → sends text queries to Ollama and streams back results.
    • /tts → converts Ollama’s responses into playable audio.
  • Manages session context so conversations feel natural.

Speech-to-Text (STT)

For STT, we used:

  • Whisper.cpp (fast, runs locally, good accuracy)
  • Alternative: Google STT or OpenAI Whisper API (for cloud setups).

STT is crucial for converting raw audio into a text query Ollama can understand.


Text-to-Speech (TTS)

For TTS, we integrated:

  • Ollama-compatible TTS engines like Piper or Coqui TTS (local, lightweight).
  • Cloud alternatives (Polly, Azure TTS, ElevenLabs) for higher-quality voices.

The backend streams TTS audio chunks so the user hears responses quickly, even before the full text is generated.


Lessons Learned

  • Latency matters: Real-time assistants feel broken if TTS/STT lag behind. Streaming both text and audio solves this.
  • Privacy vs. quality: Local models (Whisper, Piper) are great for privacy, but cloud APIs often sound better. We made both options pluggable.
  • Context handling: Keeping track of the conversation is key. Ollama’s context window size and efficient prompt engineering made this manageable.

What’s Next?

We’re working on:

  • Adding wake word detection (hands-free use).
  • Custom personalities and voices.
  • Offline mobile version using on-device models.