
Photo by Marsumilae
My Own Voice Assistant for free
A couple of days ago, I was thinking about creating my own voice assistant.
I wanted to practice creating a voice assistant and exploring its capabilities by myself, step by step.
My main goal was to discover the challenges and difficulties of building one and to learn how to overcome them.
I sat with these thoughts for a few days, and one evening, I opened my laptop and started coding—(cross that out)—I opened Gemini.
Big Picture
Before proceeding to the actual app creation, I want to outline the components the AI voice assistant will consist of and the tech stack I will use for each part.
First step: user speech.
When you ask the voice assistant a question, it needs to hear you first and convert your speech to text.
Voice consists of soundwaves, so we have to capture those waves and convert them into a mathematical representation called a spectrogram.
For this part, we will use the Whisper model to transcribe voice into text.
Next, we take this text and feed it into a language model to get the answer to your question.
And finally, we have to convert this text answer back to voice so you can hear it.
At first glance, it seems like a simple, clear process.
Process
I decided to use Python for this project since it has a lot of libraries and frameworks that can help with natural language processing and machine learning. I wanted to have a personal assistant that could potentially help me with various tasks. I started researching different technologies and platforms.
As I mentioned before, I started with the Whisper model for speech recognition.
Initially, I ran it as a CLI command:
whisper ./audio/Catching_Up_With_Friends.mp3 --model medium
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:08.400] Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie?
[00:08.400 --> 00:12.640] Yeah, good thanks. And your new baby, George, isn't it?
[00:12.640 --> 00:18.640] Ha, you've got a good memory. Yes, he's two now. What about you? Are you still working in the Health Centre?
[00:18.640 --> 00:27.640] Yes, for the time being, but we're moving in a couple of months. Anyhow, I'd better go, I'm late for work. Lovely to see you again.
[00:27.640 --> 00:29.800] Yeah, likewise. Keep in touch.By default, you receive the output in the terminal, and additionally, five files are created: .json, .srt, .tsv, .txt, and .vtt.
The most informative format, where you can find detailed information about each sample, is JSON. Here is an example of the JSON output:
{
"text": " Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie? Yeah, good thanks. And your new baby, George, isn't it? Ha, you've got a good memory. Yes, he's two now. What about you? Are you still working in the Health Centre? Yes, for the time being, but we're moving in a couple of months. Anyhow, I'd better go, I'm late for work. Lovely to see you again. Yeah, likewise. Keep in touch.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 8.4,
"text": " Jane? Mark? Hi, it's been ages since I last saw you. How are you and Jackie?",
"tokens": [
50364,
13048,
30,
3934,
30,
2421,
11,
309,
311,
668,
12357,
1670,
286,
1036,
1866,
291,
13,
1012,
366,
291,
293,
23402,
30,
50784
],
"temperature": 0.0,
"avg_logprob": -0.20693382463957133,
"compression_ratio": 1.5040322580645162,
"no_speech_prob": 0.6112962365150452
}
]
}In this block, you can find info such as:
- start and end: The exact time the sample started and ended.
- avg_logprob: A property that describes the model's confidence in the matching.
- tokens: How the AI actually reads the text. AI doesn't read letters; it reads "tokens" (chunks of words). For example,
[50396, 13048, 30, 50422]are the ID numbers in Whisper's dictionary that translate to the word " Jane" and the question mark. - no_speech_prob: The probability that this audio is actually just silence or background noise.
- compression_ratio: This measures how repetitive the text is. If this number gets very high (usually over 2.4), it means the model got stuck in a loop and started repeating the same words over and over (a common AI glitch).
- text: The final, human-readable words the model heard.
- seek: This is internal tracking for the model. Whisper processes audio in 30-second windows.
seek: 0means the model is currently analyzing the very first 30-second window of your audio file.
Briefly, how does it work? As I said before, it converts the soundwaves into a mathematical image called a spectrogram. It takes that image, figures out the tokens (which are numbers), and translates those numbers into text.
This describes the process of transcribing a static file. In our case, we need to catch a live voice stream and transcribe it chunk by chunk.
To do this, we will use the sounddevice library, which allows us to capture audio from the microphone in real time and feed it into the Whisper model for transcription as a one-dimensional array of numbers in float32 format with a frequency of 16,000 Hz.
Here is the step-by-step process:
- Record 3-5 seconds of audio from the microphone using
sounddevice. - Send this chunk to Whisper for transcription.
- Show the text.
- Repeat the process until the user stops recording.
This is the code for this process:
import sounddevice as sd
import numpy as np
import mlx_whisper
# 1. Налаштування (Whisper суворо вимагає саме такі параметри)
SAMPLE_RATE = 16000 # Частота дискретизації 16 кГц
CHUNK_DURATION = 4 # Записуємо по 4 секунди
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
def record_and_transcribe():
print(f"🎙️ Говоріть... (Слухаю по {CHUNK_DURATION} секунди)")
print("-" * 30)
try:
while True:
# 2. Записуємо звук з мікрофона
# Отримуємо масив даних
audio_data = sd.rec(
int(CHUNK_DURATION * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=1,
dtype='float32'
)
sd.wait() # Чекаємо, поки запишуться ці 4 секунди
# 3. Перетворюємо двовимірний масив на одновимірний (вимога Whisper)
audio_data = audio_data.flatten()
# 4. Відправляємо масив напряму в модель (без збереження у файл!)
result = mlx_whisper.transcribe(
audio_data,
path_or_hf_repo=MODEL_PATH
)
# Отримуємо текст і відкидаємо зайві пробіли
text = result["text"].strip()
# Виводимо текст, якщо модель щось почула (а не просто тишу)
if text:
print(f"Ви сказали: {text}")
except KeyboardInterrupt:
print("\n🛑 Зупинено користувачем.")
if __name__ == "__main__":
record_and_transcribe()But we have a problem: the Whisper model is quite heavy, and it takes about 3-4 seconds to transcribe 4 seconds of audio. This means we have to wait for the model to transcribe the audio before we can record the next chunk.
Therefore, we need to change the architecture of the app to make it continuous without blocking the recording process.
How do we make it work?
Instead of keeping the process linear, we use the Producer-Consumer pattern, where one part of the code (the Producer) is responsible for recording audio and putting it into a queue, while another part (the Consumer) is responsible for taking audio from the queue and transcribing it.
- Recording stream (Producer): Continuously records audio and puts it into a queue.
- Transcription stream (Consumer): Waits until there is audio data in the queue, takes it, feeds it into the Whisper model for transcription, and then shows the text.
- Queue: Acts as a bridge between the producer and consumer, allowing them to work independently without blocking each other.
To implement this, we will use standard Python libraries: queue and threading.
The code will look like this:
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
# Налаштування
SAMPLE_RATE = 16000
CHUNK_DURATION = 3 # Розмір шматка в секундах для одного проходу
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
# Створюємо чергу для аудіо-даних
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
"""Ця функція викликається автоматично щоразу, коли мікрофон отримує порцію звуку"""
if status:
print(f"Помилка мікрофона: {status}")
# Додаємо копію даних у чергу
audio_queue.put(indata.copy())
def transcription_worker():
"""Потік, який займається лише розпізнаванням"""
print(f"🚀 Модель {MODEL_PATH} готова. Починайте говорити...")
# Накопичувач для аудіо
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
# Беремо дані з черги
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
# Якщо накопичили достатньо для одного сегмента
if len(buffer) >= required_samples:
# Відправляємо в Whisper
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
if text:
print(f"📝 {text}")
# Очищуємо буфер (або можна залишати невеликий "overlap" для кращої точності)
buffer = np.zeros(0, dtype='float32')
def main():
# 1. Створюємо і запускаємо потік розпізнавання
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
# 2. Відкриваємо вхідний потік мікрофона
# Використовуємо InputStream для безперервного отримання даних
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
print("🎙️ Мікрофон увімкнено. Натисніть Ctrl+C для виходу.")
while True:
sd.sleep(1000) # Основний потік просто "спить", поки працюють інші
except KeyboardInterrupt:
print("\n🛑 Програму зупинено.")
if __name__ == "__main__":
main()Now we have prepared the first part of our voice assistant—the part that allows us to transcribe our voice into text in real time without blocking the recording process. The next step is to feed this text into the language model to get an answer to our question.
I will use the local model Llama 3.2 for this purpose, as it is very fast to launch and doesn't require any additional hassle with API keys, usage limits, and so on.
We have to add a new function, ask_llm, which will be responsible for taking text as an input, feeding it into the Llama 3.2 model, and getting the answer.
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
import ollama # Додаємо Ollama
SAMPLE_RATE = 16000
CHUNK_DURATION = 4
MODEL_PATH = "mlx-community/whisper-large-v3-turbo"
LLM_MODEL = "llama3.2" # Вказуємо модель Ollama
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
if status:
print(f"Помилка: {status}")
audio_queue.put(indata.copy())
def ask_llm(text):
"""Окрема функція для спілкування з моделлю"""
print(f"\n🤔 ШІ думає над: '{text}'...")
# Викликаємо Ollama
response = ollama.chat(model=LLM_MODEL, messages=[
{
'role': 'system',
'content': 'Ти розумний і лаконічний асистент. Відповідай коротко і українською мовою.'
},
{
'role': 'user',
'content': text
}
])
# Виводимо відповідь моделі зеленим кольором для наочності
print(f"\n🟢 Відповідь: {response['message']['content']}\n")
def transcription_worker():
print(f"🚀 Whisper ({MODEL_PATH}) готовий.")
print(f"🧠 LLM ({LLM_MODEL}) готова. Говоріть...")
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= required_samples:
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
# Якщо ми розпізнали текст і він довший за пару символів (щоб ігнорувати шуми)
if text and len(text) > 3:
print(f"\n🎙️ Ви сказали: {text}")
# Запускаємо LLM в окремому потоці, щоб не блокувати запис мікрофона!
llm_thread = threading.Thread(target=ask_llm, args=(text,))
llm_thread.start()
buffer = np.zeros(0, dtype='float32')
def main():
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
while True:
sd.sleep(1000)
except KeyboardInterrupt:
print("\n🛑 Зупинено.")
if __name__ == "__main__":
main()And happy days—it works now!
Or, it doesn't fully work. During testing, I found a significant issue: even if you stop talking, the model continues to transcribe the silence, producing random tokens. These tokens get fed into the language model, resulting in random answers! For example, last time it "heard" "Thank you, Thank you..." on loop.
To solve this problem, we can take two steps:
- Programmatically tune our
mlx_whispersettings. - Add a silence filter to skip sound samples that are too quiet.
We will use both options.
In the transcribe function in mlx_whisper, we can configure no_speech_threshold to set a limit for the no_speech_prob output.
Additionally, we will use RMS (Root Mean Square) to measure the loudness of the audio and set a threshold to filter out silent segments. If the sound we catch is too quiet, we will skip it.
After these changes, our code will look like this:
def transcription_worker():
print(f"🚀 Whisper ({MODEL_PATH}) готовий.")
print(f"🧠 LLM ({LLM_MODEL}) готова. Говоріть...")
buffer = np.zeros(0, dtype='float32')
required_samples = int(CHUNK_DURATION * SAMPLE_RATE)
# ПОРІГ ЧУТЛИВОСТІ (Volume Threshold)
# Збільшуйте це число, якщо мікрофон ловить дихання чи кулери (наприклад, 0.015)
# Зменшуйте, якщо він ігнорує ваш тихий голос (наприклад, 0.005)
VOLUME_THRESHOLD = 0.01
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= required_samples:
# 1. Рахуємо рівень гучності (RMS)
volume_level = np.sqrt(np.mean(buffer**2))
# 2. Перевіряємо, чи хтось взагалі говорить
if volume_level < VOLUME_THRESHOLD:
# Це просто тиша. Очищуємо буфер і йдемо далі, не чіпаючи Whisper!
buffer = np.zeros(0, dtype='float32')
continue
# 3. Якщо звук гучний, відправляємо у Whisper з додатковими параметрами
result = mlx_whisper.transcribe(
buffer,
path_or_hf_repo=MODEL_PATH,
condition_on_previous_text=False
)
text = result["text"].strip()
# Додатковий "фільтр-милиця" від найпопулярніших галюцинацій
stop_words = ["Thank you.", "Thank you", "Thanks.", "Subscribe"]
if text and text not in stop_words and len(text) > 3:
print(f"\n🎙️ Ви сказали: {text} (Гучність: {volume_level:.3f})")
llm_thread = threading.Thread(target=ask_llm, args=(text,))
llm_thread.start()
buffer = np.zeros(0, dtype='float32')Final Part
The last important part of our voice assistant is the text-to-speech component, which will allow us to hear the answer from the language model.
This last step presents one more challenge: echo. When our model starts speaking to us, our microphone starts listening to it and hears the echo, putting the whole system into an infinite loop.
Since I use my Mac to run this app, I will use a zero-cost solution: macOS's native speech capabilities. macOS has built-in, high-quality speech synthesis. Because it is built-in, we can avoid running yet another memory-consuming model to convert text to speech. It works instantly, has good voice quality, and doesn't require any additional configuration.
def transcription_worker():
# ... налаштування ...
buffer = np.zeros(0, dtype='float32')
# Скільки секунд ми залишаємо для контексту (overlap)
OVERLAP_SAMPLES = int(1.0 * SAMPLE_RATE) # 1 секунда
REQUIRED_SAMPLES = int(CHUNK_DURATION * SAMPLE_RATE)
while True:
new_data = audio_queue.get()
buffer = np.append(buffer, new_data.flatten())
if len(buffer) >= REQUIRED_SAMPLES:
# Розпізнаємо весь накопичений буфер
result = mlx_whisper.transcribe(buffer, path_or_hf_repo=MODEL_PATH)
text = result["text"].strip()
if text:
print(f"📝 {text}")
# КЛЮЧОВИЙ МОМЕНТ:
# Замість очищення до нуля, залишаємо останню секунду для наступного разу
buffer = buffer[-OVERLAP_SAMPLES:]Polishing
During testing, I found that we are limited by the buffer size. Because we are using a fixed-size buffer, if we speak for more than 4 seconds, we will lose part of our speech. To solve this, we should listen to the user until they take a pause (e.g., 0.5s), and only then send the audio to the Whisper model for transcription. To achieve this, we will use VAD (Voice Activity Detection), specifically Silero VAD.
How will it work?
- We will continuously listen to the user and analyze the audio in 30ms chunks.
- If the neural network detects that the user is speaking, we will start accumulating the audio in the buffer.
- If the neural network detects that the user has stopped speaking for more than 0.5s, we can assume the user has finished their question.
- We will send the accumulated audio to the Whisper model for transcription and then clear the buffer for the next question.
Here is the updated code with the VAD addition:
import sounddevice as sd
import numpy as np
import mlx_whisper
import queue
import threading
import ollama
import subprocess
import torch
# Завантаження VAD
model_vad, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
SAMPLE_RATE:int = 16000
VAD_WINDOW:int = 512
SILENCE_LIMIT = 20
MODEL_PATH:str = "mlx-community/whisper-large-v3-turbo"
LLM_MODEL:str = "glm-5.1:cloud"
# Дві черги: одна для звуку з мікрофона, інша для тексту від Whisper
audio_queue = queue.Queue()
text_queue = queue.Queue() # <--- НОВА ЧЕРГА ДЛЯ ТЕКСТУ
is_ai_speaking = threading.Event()
def audio_callback(indata, frames, time, status):
if status:
print(f"Помилка мікрофона: {status}")
# РОЗКОМЕНТОВАНО: Якщо ШІ говорить, просто викидаємо звук
if is_ai_speaking.is_set():
return
audio_queue.put(indata.copy())
def speak(text):
print("🔊 Асистент говорить...")
# Блокуємо мікрофон
is_ai_speaking.set()
try:
subprocess.run(["say", "-v", "Serena", text])
finally:
# Вичищаємо залишки звуку, які могли потрапити в чергу в останню мілісекунду
while not audio_queue.empty():
audio_queue.get_nowait()
is_ai_speaking.clear()
print("🎙️ Мікрофон знову слухає...")
def ask_llm(text):
print(f"\n🤔 ШІ думає над: '{text}'...")
response = ollama.chat(LLM_MODEL, messages=[
{
'role': 'system',
'content': 'You are a helpful assistant that answers questions based on user input. Please provide concise and informative responses. Make sure to address the user\'s query directly and clearly. Keep your answers short. Don"t use any formatting, just plain text. If you don\'t know the answer, say you don\'t know'
},
{
'role': 'user',
'content': text
}
])
# Використовуємо ваш синтаксис для об'єкта
content_answer:str = response.message.content.strip()
print(f"🤖 Відповідь LLM: {content_answer}")
speak(content_answer)
def ai_assistant_worker():
"""НОВИЙ ПОТІК: Гарантує, що ШІ відповідатиме строго по черзі, без накладання голосів"""
while True:
# Беремо текст з черги. Якщо черга порожня, потік просто чекає
text = text_queue.get()
ask_llm(text)
def transcription_worker():
print(f"Whisper ({MODEL_PATH}) готовий.")
print(f"🧠 LLM ({LLM_MODEL}) готова. Говоріть...")
full_audio_buffer = []
silence_counter = 0
is_speaking = False
# ВИПРАВЛЕНО: Для нейромережі VAD поріг впевненості - це ймовірність. 0.5 означає 50%
VAD_THRESHOLD = 0.5
while True:
# Якщо ШІ почав говорити, ми "забуваємо" все, що почали слухати до цього
if is_ai_speaking.is_set():
full_audio_buffer = []
is_speaking = False
silence_counter = 0
# Спустошуємо аудіо-чергу вхолосту
try:
audio_queue.get_nowait()
except queue.Empty:
pass
continue
chunk = audio_queue.get()
chunk_flat = chunk.flatten()
new_confidence = model_vad(torch.from_numpy(chunk_flat), SAMPLE_RATE).item()
if new_confidence > VAD_THRESHOLD:
if not is_speaking:
is_speaking = True
print(f"🎙️ Запис пішов...")
full_audio_buffer.append(chunk_flat)
silence_counter = 0
else:
if is_speaking:
full_audio_buffer.append(chunk_flat)
silence_counter += 1
if silence_counter > SILENCE_LIMIT:
print(f"🎙️ Запис зупинено. Обробка...")
audio_to_process = np.concatenate(full_audio_buffer)
result = mlx_whisper.transcribe(
audio_to_process,
path_or_hf_repo=MODEL_PATH,
language='en',
condition_on_previous_text=False,
no_speech_threshold=0.7
)
text = result['text'].strip()
stop_words = ["Thank you.", "Thank you", "Thanks.", "Subscribe"]
if text and text not in stop_words and len(text) > 3:
print(f"\n🎙️ Ви сказали: {text}")
# ВИПРАВЛЕНО: Замість створення нового потоку просто кладемо текст у чергу!
text_queue.put(text)
full_audio_buffer = []
is_speaking = False
silence_counter = 0
def main():
# 1. Запускаємо потік Асистента (LLM + Голос)
assistant_thread = threading.Thread(target=ai_assistant_worker, daemon=True)
assistant_thread.start()
# 2. Запускаємо потік Розпізнавання (Whisper)
worker_thread = threading.Thread(target=transcription_worker, daemon=True)
worker_thread.start()
# 3. Вмикаємо мікрофон
try:
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback, blocksize=VAD_WINDOW):
print("🎙️ Мікрофон увімкнено. Натисніть Ctrl+C для виходу.")
while True:
sd.sleep(1000)
except KeyboardInterrupt:
print("\n Програму зупинено.")
if __name__ == "__main__":
main()Conclusion
In this post, I shared my path to creating a ready-to-use voice assistant. You would be right to say that there are many things to improve and add to make it faster and more useful, but I think the main goal was achieved. That goal was to practice creating a real-world application and learn something new along the way.
Nowadays, we've significantly lowered the barrier for real challenges—the searching, debugging, and figuring out why something doesn't work. Now, you just enter your question and instantly get an answer. We no longer have to do everything by ourselves, nor do we have to build incredibly complex things from scratch just to feel the satisfaction of solving a challenge.
The days of coding just because you love the process of crafting something with your own hands are changing. Now, you are moving more into a managerial role over the processes; you become more like an operator of a machine. Of course, engineering expertise is still highly required because you need to know what you are creating (at least at a high level) and how you will extend and maintain it tomorrow.
But to ride the wave of how deeply AI is currently integrated into the engineering process and the day-to-day work of an engineer, you should use it yourself as much as you can (until all your tokens are used up—ha ha, just joking). It’s the best way to understand how it works, what its limitations are, and how to use it most efficiently.