Last Update: Jul 17, 2024

Coding with AI

I wrote a book! Check out A Quick Guide to Coding with AI.
Learn how to use Generative AI coding tools as a force multiplier for your career.
Use my code mlmorgan3 to get 50% off (Until Sept 27th).


So you want to transcribe some audio into text. There are a lot of great services out there, where you upload a video or MP3, and they’ve been around for years. But what if you have a TON of text to transcribe? Or do you want to save money? You can do just that with an open-source tool called Whisper and a tiny bit of Python. You can run a tool on your local machine to transcribe audio files into text. Here’s how.

If you prefer a video version of this tutorial, I made one!

OpenAI Whisper

For this tutorial, we’re going to use a local version of OpenAI Whisper. Whisper is an advanced AI neural net that is accurate at close to human levels for English speech recognition. It’s awesome.

You can access a Whisper API from OpenAI that will be exponentially better than anything you can run locally. But there’s a cost associated with it, and heck, we like just running things locally to see if we can, right? Let’s do it.

Set up Your Environment

For this demonstration, I’m running Ubuntu under WSL in Windows. The instructions for setting it up in Ubuntu proper are the same. I have yet to try this on a Mac, but I will.

The first thing you do, of course, is update the system.

sudo apt update
sudo apt upgrade

Now, you will need some base packages installed on the system for this to work. Mainly FFmpeg, which can be installed with this:

sudo apt install ffmpeg

You should be good to go. Let’s create a Python environment:

mkdir whispertest && cd whispertest
python3 -m venv whispertest
source whispertest/bin/activate

Remember, you should see the environment name to the left of your prompt:

“How to Transcribe Audio to Text Python”

Then, we’ll need to install the Rust setup tools:

pip install setuptools-rust

Note: If you have an NVidia GPU

If you have an NVIDIA GPU, you must install the NVIDIA drivers for this to work properly.

You can verify they’re installed correctly by typing:

nvidia-smi

And you should see something like this:

“How to Transcribe Audio to Text Python”

Install Whisper

Whisper runs as an executable within your Python environment. It’s pretty cool.

The best way to install it is:

pip install -U openai-whisper

But you can also pull the latest version straight from the repository if you like:

pip install git+https://github.com/openai/whisper.git

Either way, it will install a bunch of packages, so go get some ice water. When it’s done, the whisper executable will be installed.

I recorded a sample file, and here’s how we can run it.

whisper [audio.flac audio.mp3 audio.wav] --model [model size]

I will start with the tiny model just to see how it performs. Here’s a list of available models

 Size  ParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
 tiny     39 M        tiny.en      tiny    ~1 GB         ~32x      
 base     74 M        base.en      base    ~1 GB         ~16x      
small    244 M        small.en     small    ~2 GB         ~6x      
medium  769 M       medium.en     medium    ~5 GB         ~2x      
large    1550 M         N/A             large   ~10 GB          1x      

I’ll start with the smallest model and see its accuracy, then work my way up if needed.

Here’s the command I ran to parse and extract from my sample file:

whisper sample-audio.wav --model tiny

And lucky for me, it was transcribed perfectly:

“How to Transcribe Audio to Text Python”

Your results will vary. If you don’t like the output you can always step it up to a larger model, which will take more memory and a longer amount of time.

So, what else can you do with this tool?

Building a Cool Python Script

The Whisper service has a bunch of cool features that I don’t use, like translation! But what if we want to script this stuff, like processing 100 audio files or something? Building a Python script to run it is easy.

Here’s a script straight from the GitHub page:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

And when I run it, it shows clean text output.

“How to Transcribe Audio to Text Python”

You can of course, write this to its own text file:

with open("output.txt", "w") as file:
 file.write(result["text"])

There are tons of options available. It also does transcriptions in other languages as well.

Summary

In this tutorial, we installed Whisper and played around with it. It’s super easy to use and very performant. I have yet to do a lot of thorough testing with it, but so far, it’s been very accurate.

I’d love to hear from you if you’re doing something cool with this!

Be sure to bookmark this blog for more cool stuff like this.

– Jeremy



Stay up to date on the latest in Computer Vision and AI.

Get notified when I post new articles!

Intuit Mailchimp




Published: Jul 16, 2024 by Jeremy Morgan. Contact me before republishing this content.