Building Aurel-AI: A Personal AI Assistant with Ollama, TTS, and STT with 3D Model
Voice-driven assistants are no longer a luxury feature — they’re becoming the default way we interact with AI. When we started designing Aurel-AI, our goal was to create a personal, private, uncensored AI assistant that runs locally and integrates seamlessly with Ollama while also supporting speech-to-text (STT) and text-to-speech (TTS) for natural, conversational interaction.
In this post, we’ll walk through:
- The architecture of Aurel-AI
- How we wired up frontend ↔ backend ↔ Ollama ↔ TTS/STT
- Key development challenges and solutions
- A diagram of the system workflow
Why Ollama?
We chose Ollama because it’s:
- Local-first (your data stays private)
- Lightweight and easy to integrate with custom APIs
- Flexible with multiple open-source models
This makes it the perfect backbone for an uncensored personal assistant that you fully control.
Core Workflow
The assistant works in a loop of voice and text:
- User speaks a query
→ STT converts speech into text. - Backend API sends the text to Ollama
→ Ollama generates a response. - Backend transforms Ollama’s text output into speech
→ TTS generates audio for playback. - Frontend delivers both text + audio to the user
→ User can continue by speaking again (continuous conversation).
System Architecture Diagram
Here’s a simplified view of the workflow:
flowchart TD
A[User Voice Input 🎤] --> B[Frontend STT API Call]
B --> C[Backend STT Engine]
C --> D[Text Query]
D --> E[Backend API → Ollama]
E --> F[Ollama LLM Response]
F --> G[Backend TTS Engine]
G --> H[Audio Response 🔊 + Text Display]
H --> A
This loop creates the real-time voice assistant experience.
Frontend
- Built with Angular for a lightweight UI.
- Microphone access for voice capture.
- Displays chat history (like a messenger).
- Streams audio responses from the backend in near real-time.
Backend
- Node.js/Express server that connects everything together.
- Routes for:
/stt→ handles user voice input, passes to STT engine./chat→ sends text queries to Ollama and streams back results./tts→ converts Ollama’s responses into playable audio.
- Manages session context so conversations feel natural.
Speech-to-Text (STT)
For STT, we used:
- Whisper.cpp (fast, runs locally, good accuracy)
- Alternative: Google STT or OpenAI Whisper API (for cloud setups).
STT is crucial for converting raw audio into a text query Ollama can understand.
Text-to-Speech (TTS)
For TTS, we integrated:
- Ollama-compatible TTS engines like Piper or Coqui TTS (local, lightweight).
- Cloud alternatives (Polly, Azure TTS, ElevenLabs) for higher-quality voices.
The backend streams TTS audio chunks so the user hears responses quickly, even before the full text is generated.
Lessons Learned
- Latency matters: Real-time assistants feel broken if TTS/STT lag behind. Streaming both text and audio solves this.
- Privacy vs. quality: Local models (Whisper, Piper) are great for privacy, but cloud APIs often sound better. We made both options pluggable.
- Context handling: Keeping track of the conversation is key. Ollama’s context window size and efficient prompt engineering made this manageable.
What’s Next?
We’re working on:
- Adding wake word detection (hands-free use).
- Custom personalities and voices.
- Offline mobile version using on-device models.