
Photo by Marsumilae
My Own Voice Assistant for free
A couple of days ago, I was thinking about creating my own voice assistant.
I wanted to practice creating a voice assistant and exploring its capabilities by myself, step by step.
My main goal was to discover the challenges and difficulties of building one and to learn how to overcome them.
I sat with these thoughts for a few days, and one evening, I opened my laptop and started coding (no) - instead I opened Gemini.
Big Picture
Before proceeding to the actual app creation, I want to outline the components the AI voice assistant will consist of and the tech stack I will use for each part.
First step: user speech.
When you ask the voice assistant a question, it needs to hear you first and convert your speech to text.
Voice consists of soundwaves, so we have to capture those waves and convert them into a mathematical representation called a spectrogram.
For this part, we will use the Whisper model to transcribe voice into text.
Next, we take this text and feed it into a language model to get the answer to your question.
And finally, we have to convert this text answer back to voice so you can hear it.
At first glance, it seems like a simple, clear process.
Process
I decided to use Python for this project since it has a lot of libraries and frameworks that can help with natural language processing and machine learning. I wanted to have a personal assistant that could potentially help me with various tasks. I started researching different technologies and platforms.
As I mentioned before, I started with the Whisper model for speech recognition.
Initially, I ran it as a CLI command:
whisper ./audio/Catching_Up_With_Friends.mp3 --model medium
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:08.400] Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie?
[00:08.400 --> 00:12.640] Yeah, good thanks. And your new baby, George, isn't it?
[00:12.640 --> 00:18.640] Ha, you've got a good memory. Yes, he's two now. What about you? Are you still working in the Health Centre?
[00:18.640 --> 00:27.640] Yes, for the time being, but we're moving in a couple of months. Anyhow, I'd better go, I'm late for work. Lovely to see you again.
[00:27.640 --> 00:29.800] Yeah, likewise. Keep in touch.By default, you receive the output in the terminal, and additionally, five files are created: .json, .srt, .tsv, .txt, and .vtt.
The most informative format, where you can find detailed information about each sample, is JSON. Here is an example of the JSON output:
{
"text": " Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie? Yeah, good thanks. And your new baby, George, isn't it? Ha, you've got a good memory. Yes, he's two now. What about you? Are you still working in the Health Centre? Yes, for the time being, but we're moving in a couple of months. Anyhow, I'd better go, I'm late for work. Lovely to see you again. Yeah, likewise. Keep in touch.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 8.4,
"text": " Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie?",
"tokens": [
50364,
13048,
30,
3934,
30,
2421,
11,
309,
311,
668,
12357,
1670,
286,
1036,
1866,
291,
13,
1012,
366,
291,
293,
23402,
30,
50784
],
"temperature": 0.0,
"avg_logprob": -0.20693382463957133,
"compression_ratio": 1.5040322580645162,
"no_speech_prob": 0.6112962365150452
}
]
}In this block, you can find info such as:
- start and end: The exact time the sample started and ended.
- avg_logprob: A property that describes the model's confidence in the matching.
- tokens: How the AI actually reads the text. AI doesn't read letters; it reads "tokens" (chunks of words). For example,
[50396, 13048, 30, 50422]are the ID numbers in Whisper's dictionary that translate to the word " Jane" and the question mark. - no_speech_prob: The probability that this audio is actually just silence or background noise.
- compression_ratio: This measures how repetitive the text is. If this number gets very high (usually over 2.4), it means the model got stuck in a loop and started repeating the same words over and over (a common AI glitch).
- text: The final, human-readable words the model heard.
- seek: This is internal tracking for the model. Whisper processes audio in 30-second windows.
seek: 0means the model is currently analyzing the very first 30-second window of your audio file.
Briefly, how does it work? As I said before, it converts the soundwaves into a mathematical image called a spectrogram. It takes that image, figures out the tokens (which are numbers), and translates those numbers into text.
This describes the process of transcribing a static file. In our case, we need to catch a live voice stream and transcribe it chunk by chunk.
To do this, we will use the sounddevice library, which allows us to capture audio from the microphone in real time and feed it into the Whisper model for transcription as a one-dimensional array of numbers in float32 format with a frequency of 16,000 Hz.
Here is the step-by-step process:
- Record 3-5 seconds of audio from the microphone using
sounddevice. - Send this chunk to Whisper for transcription.
- Show the text.
- Repeat the process until the user stops recording.
This is the code for this process:
import sounddevice as sd
import numpy as np
import mlx_whisper
# 1. Settings (Whisper strictly requires these parameters)
SAMPLE_RATE = 16000 # Sample rate 16 kHz
CHUNK_DURATION = 4 # Record for 4 seconds
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
def record_and_transcribe():
print(f"ποΈ Speak... (Listening for {CHUNK_DURATION} seconds)")
print("-" * 30)
try:
while True:
# 2. Record audio from the microphone
# Get data array
audio_data = sd.rec(
int(CHUNK_DURATION * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=1,
dtype='float32'
)
sd.wait() # Wait until these 4 seconds are recorded
# 3. Convert 2D array to 1D (Whisper requirement)
audio_data = audio_data.flatten()
# 4. Send the array directly to the model (without saving to a file!)
result = mlx_whisper.transcribe(
audio_data,
path_or_hf_repo=MODEL_PATH
)
# Get the text and strip extra spaces
text = result["text"].strip()
# Print the text if the model heard something (not just silence)
if text:
print(f"You said: {text}")
except KeyboardInterrupt:
print("\nπ Stopped by user.")
if __name__ == "__main__":
record_and_transcribe()But we have a problem: the Whisper model is quite heavy, and it takes about 3-4 seconds to transcribe 4 seconds of audio. This means we have to wait for the model to transcribe the audio before we can record the next chunk.
Therefore, we need to change the architecture of the app to make it continuous without blocking the recording process.
How do we make it work?
Instead of keeping the process linear, we use the Producer-Consumer pattern, where one part of the code (the Producer) is responsible for recording audio and putting it into a queue, while another part (the Consumer) is responsible for taking audio from the queue and transcribing it.
- Recording stream (Producer): Continuously records audio and puts it into a queue.
- Transcription stream (Consumer): Waits until there is audio data in the queue, takes it, feeds it into the Whisper model for transcription, and then shows the text.
- Queue: Acts as a bridge between the producer and consumer, allowing them to work independently without blocking each other.
To implement this, we will use standard Python libraries: queue and threading.
The code will look like this:
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
# Settings
SAMPLE_RATE = 16000
CHUNK_DURATION = 3 # Chunk size in seconds for one pass
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
# Create a queue for audio data
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
"""This function is called automatically every time the microphone receives a portion of sound"""
if status:
print(f"Microphone error: {status}")
# Add a copy of the data to the queue
audio_queue.put(indata.copy())
def transcription_worker():
"""Thread that handles only speech recognition"""
print(f"π Model {MODEL_PATH} is ready. Start speaking...")
# Accumulator for audio
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
# Get data from the queue
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
# If enough data for one segment is accumulated
if len(buffer) >= required_samples:
# Send to Whisper
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
if text:
print(f"π {text}")
# Clear the buffer (or leave a small overlap for better accuracy)
buffer = np.zeros(0, dtype='float32')
def main():
# 1. Create and start the recognition thread
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
# 2. Open the input microphone stream
# Use InputStream for continuous data receiving
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
print("ποΈ Microphone is on. Press Ctrl+C to exit.")
while True:
sd.sleep(1000) # The main thread just "sleeps" while others are working
except KeyboardInterrupt:
print("\nπ Program stopped.")
if __name__ == "__main__":
main()Now we have prepared the first part of our voice assistantβthe part that allows us to transcribe our voice into text in real time without blocking the recording process. The next step is to feed this text into the language model to get an answer to our question.
I will use the local model Llama 3.2 for this purpose, as it is very fast to launch and doesn't require any additional hassle with API keys, usage limits, and so on.
We have to add a new function, ask_llm, which will be responsible for taking text as an input, feeding it into the Llama 3.2 model, and getting the answer.
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
import ollama # Add Ollama
SAMPLE_RATE = 16000
CHUNK_DURATION = 4
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
LLM_MODEL = "llama3.2" # Specify the Ollama model
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
if status:
print(f"Error: {status}")
audio_queue.put(indata.copy())
def ask_llm(text):
"""Separate function to communicate with the model"""
print(f"\nπ€ AI is thinking about: '{text}'...")
# Call Ollama
response = ollama.chat(model=LLM_MODEL, messages=[
{
'role': 'system',
'content': 'You are a smart and concise assistant. Answer briefly and in English.'
},
{
'role': 'user',
'content': text
}
])
# Output the model's response
print(f"\nπ’ Answer: {response['message']['content']}\n")
def transcription_worker():
print(f"π Whisper ({MODEL_PATH}) ready.")
print(f"π§ LLM ({LLM_MODEL}) ready. Speak...")
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= required_samples:
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
# If we recognized text and it's longer than a couple of characters (to ignore noise)
if text and len(text) > 3:
print(f"\nποΈ You said: {text}")
# Start LLM in a separate thread to not block microphone recording!
llm_thread = threading.Thread(target=ask_llm, args=(text,))
llm_thread.start()
buffer = np.zeros(0, dtype='float32')
def main():
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
while True:
sd.sleep(1000)
except KeyboardInterrupt:
print("\nπ Stopped.")
if __name__ == "__main__":
main()And happy daysβit works now!
Or, it doesn't fully work. During testing, I found a significant issue: even if you stop talking, the model continues to transcribe the silence, producing random tokens. These tokens get fed into the language model, resulting in random answers! For example, last time it "heard" "Thank you, Thank you..." on loop.
To solve this problem, we can take two steps:
- Programmatically tune our
mlx_whispersettings. - Add a silence filter to skip sound samples that are too quiet.
We will use both options.
In the transcribe function in mlx_whisper, we can configure no_speech_threshold to set a limit for the no_speech_prob output.
Additionally, we will use RMS (Root Mean Square) to measure the loudness of the audio and set a threshold to filter out silent segments. If the sound we catch is too quiet, we will skip it.
After these changes, our code will look like this:
def transcription_worker():
print(f"π Whisper ({MODEL_PATH}) ready.")
print(f"π§ LLM ({LLM_MODEL}) ready. Speak...")
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
# VOLUME THRESHOLD
# Increase this number if the microphone picks up breathing or coolers (e.g., 0.015)
# Decrease it if it ignores your quiet voice (e.g., 0.005)
VOLUME_THRESHOLD = 0.01
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= required_samples:
# 1. Calculate the volume level (RMS)
volume_level = np.sqrt(np.mean(buffer**2))
# 2. Check if anyone is speaking at all
if volume_level < VOLUME_THRESHOLD:
# It's just silence. Clear the buffer and continue without touching Whisper!
buffer = np.zeros(0, dtype='float32')
continue
# 3. If the sound is loud, send it to Whisper with additional parameters
result = mlx_whisper.transcribe(
buffer,
path_or_hf_repo=MODEL_PATH,
condition_on_previous_text=False
)
text = result["text"].strip()
# Additional "crutch filter" for the most popular hallucinations
stop_words = ["Thank you.", "Thank you", "Thanks.", "Subscribe"]
if text and text not in stop_words and len(text) > 3:
print(f"\nποΈ You said: {text} (Volume: {volume_level:.3f})")
llm_thread = threading.Thread(target=ask_llm, args=(text,))
llm_thread.start()
buffer = np.zeros(0, dtype='float32')Final Part
The last important part of our voice assistant is the text-to-speech component, which will allow us to hear the answer from the language model.
This last step presents one more challenge: echo. When our model starts speaking to us, our microphone starts listening to it and hears the echo, putting the whole system into an infinite loop.
Since I use my Mac to run this app, I will use a zero-cost solution: macOS's native speech capabilities. macOS has built-in, high-quality speech synthesis. Because it is built-in, we can avoid running yet another memory-consuming model to convert text to speech. It works instantly, has good voice quality, and doesn't require any additional configuration.
def transcription_worker():
# ... settings ...
buffer = np.zeros(0, dtype='float32')
# How many seconds we leave for context (overlap)
OVERLAP_SAMPLES = int(1.0 * SAMPLE_RATE) # 1 second
REQUIRED_SAMPLES = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= REQUIRED_SAMPLES:
# Recognize the entire accumulated buffer
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
if text:
print(f"π {text}")
# KEY MOMENT:
# Instead of clearing to zero, leave the last second for the next time
buffer = buffer[-OVERLAP_SAMPLES:]Polishing
During testing, I found that we are limited by the buffer size. Because we are using a fixed-size buffer, if we speak for more than 4 seconds, we will lose part of our speech. To solve this, we should listen to the user until they take a pause (e.g., 0.5s), and only then send the audio to the Whisper model for transcription. To achieve this, we will use VAD (Voice Activity Detection), specifically Silero VAD.
How will it work?
- We will continuously listen to the user and analyze the audio in 30ms chunks.
- If the neural network detects that the user is speaking, we will start accumulating the audio in the buffer.
- If the neural network detects that the user has stopped speaking for more than 0.5s, we can assume the user has finished their question.
- We will send the accumulated audio to the Whisper model for transcription and then clear the buffer for the next question.
Here is the updated code with the VAD addition:
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
import ollama
import subprocess
import torch
# Load VAD
model_vad, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
SAMPLE_RATE:int = 16000
VAD_WINDOW:int = 512
SILENCE_LIMIT = 20
MODEL_PATH:str = "mlx-community/whisper-large-v3-turbo"
LLM_MODEL:str = "glm-5.1:cloud"
# Two queues: one for mic audio, another for Whisper text
audio_queue = queue.Queue()
text_queue = queue.Queue() # <--- NEW QUEUE FOR TEXT
is_ai_speaking = threading.Event()
def audio_callback(indata, frames, time, status):
if status:
print(f"Microphone error: {status}")
# UNCOMMENTED: If AI is speaking, just drop the sound
if is_ai_speaking.is_set():
return
audio_queue.put(indata.copy())
def speak(text):
print("π Assistant is speaking...")
# Block the microphone
is_ai_speaking.set()
try:
subprocess.run(["say", "-v", "Serena", text])
finally:
# Clear leftover sound that might have entered the queue in the last millisecond
while not audio_queue.empty():
audio_queue.get_nowait()
is_ai_speaking.clear()
print("ποΈ Microphone is listening again...")
def ask_llm(text):
print(f"\nπ€ AI is thinking about: '{text}'...")
response = ollama.chat(LLM_MODEL, messages=[
{
'role': 'system',
'content': 'You are a helpful assistant that answers questions based on user input. Please provide concise and informative responses. Make sure to address the user\'s query directly and clearly. Keep your answers short. Don"t use any formatting, just plain text. If you don\'t know the answer, say you don\'t know'
},
{
'role': 'user',
'content': text
}
])
# Use your syntax for the object
content_answer:str = response.message.content.strip()
print(f"π€ LLM Answer: {content_answer}")
speak(content_answer)
def ai_assistant_worker():
"""NEW THREAD: Ensures AI answers strictly in turn, without voices overlapping"""
while True:
# Get text from the queue. If empty, the thread just waits
text = text_queue.get()
ask_llm(text)
def transcription_worker():
print(f"Whisper ({MODEL_PATH}) ready.")
print(f"π§ LLM ({LLM_MODEL}) ready. Speak...")
full_audio_buffer = []
silence_counter = 0
is_speaking = False
# FIXED: For VAD neural network, the confidence threshold is probability. 0.5 means 50%
VAD_THRESHOLD = 0.5
while True:
# If AI started speaking, we "forget" everything we started listening to before
if is_ai_speaking.is_set():
full_audio_buffer = []
is_speaking = False
silence_counter = 0
# Empty the audio queue
try:
audio_queue.get_nowait()
except queue.Empty:
pass
continue
chunk = audio_queue.get()
chunk_flat = chunk.flatten()
new_confidence = model_vad(torch.from_numpy(chunk_flat), SAMPLE_RATE).item()
if new_confidence > VAD_THRESHOLD:
if not is_speaking:
is_speaking = True
print(f"ποΈ Recording started...")
full_audio_buffer.append(chunk_flat)
silence_counter = 0
else:
if is_speaking:
full_audio_buffer.append(chunk_flat)
silence_counter += 1
if silence_counter > SILENCE_LIMIT:
print(f"ποΈ Recording stopped. Processing...")
audio_to_process = np.concatenate(full_audio_buffer)
result = mlx_whisper.transcribe(
audio_to_process,
path_or_hf_repo=MODEL_PATH,
language='en',
condition_on_previous_text=False,
no_speech_threshold=0.7
)
text = result['text'].strip()
stop_words = ["Thank you.", "Thank you", "Thanks.", "Subscribe"]
if text and text not in stop_words and len(text) > 3:
print(f"\nποΈ You said: {text}")
# FIXED: Instead of creating a new thread, just put text in the queue!
text_queue.put(text)
full_audio_buffer = []
is_speaking = False
silence_counter = 0
def main():
# 1. Start the Assistant thread (LLM + Voice)
assistant_thread = threading.Thread(target=ai_assistant_worker, daemon=True)
assistant_thread.start()
# 2. Start the Recognition thread (Whisper)
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
# 3. Turn on the microphone
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback, blocksize=VAD_WINDOW):
print("ποΈ Microphone is on. Press Ctrl+C to exit.")
while True:
sd.sleep(1000)
except KeyboardInterrupt:
print("\n Program stopped.")
if __name__ == "__main__":
main()Conclusion
In this post, I shared my path to creating a ready-to-use voice assistant. You would be right to say that there are many things to improve and add to make it faster and more useful, but I think the main goal was achieved. That goal was to practice creating a real-world application and learn something new along the way.
Nowadays, we've significantly lowered the barrier for real challengesβthe searching, debugging, and figuring out why something doesn't work. Now, you just enter your question and instantly get an answer. We no longer have to do everything by ourselves, nor do we have to build incredibly complex things from scratch just to feel the satisfaction of solving a challenge.
The days of coding just because you love the process of crafting something with your own hands are changing. Now, you are moving more into a managerial role over the processes; you become more like an operator of a machine. Of course, engineering expertise is still highly required because you need to know what you are creating (at least at a high level) and how you will extend and maintain it tomorrow.
But to ride the wave of how deeply AI is currently integrated into the engineering process and the day-to-day work of an engineer, you should use it yourself as much as you can (until all your tokens are used upβha ha, just joking). Itβs the best way to understand how it works, what its limitations are, and how to use it most efficiently.