Adding speech to text in your application with Spring AI
Use Spring AI and OpenAI to easily add speech-to-text functionality to Spring apps and transform audio files into transcriptions, translations, and VTT formats.
Jan 9, 2025 • 6 Minute Read
I always like it when something “just works”. For example, it’s quite a treat when the documentation matches reality, when entry-level use cases are simple, and when APIs have nice, reasonable defaults.
What we’re going to build together today is a speech-to-text application using some basic Spring Boot starters, including Spring AI. By the end, we’ll have an endpoint that can upload an audio file, hand it off to OpenAI, and return its transcription in a number of formats, including VTT.
Spring Initializr
Like many Spring applications, this one starts with selecting the right values from Spring Initializr. In this case, I’ll select Spring Web and Spring AI only:
You can either go to https://start.spring.io and fill out the same values, or you can simply create it with this link.
Having done that, unzip the file wherever you run your code and open the project in the IDE of your choice.
Dependencies
Taking a look at the `build.gradle` file, you’ll see the needed dependencies for this Spring AI application:
ext {
set('springAiVersion', "1.0.0-M3")
}
dependencies {
implementation 'org.springframework.boot:spring-boot-starter
-web'
implementation 'org.springframework.ai:spring-ai-openai
-spring-boot-starter'
// …
}
dependencyManagement {
imports {
mavenBom "org.springframework.ai:spring-ai-bom:${
springAiVersion}"
}
}
Notice that Spring AI ships with a Bill of Materials, which is valuable for ensuring the Spring AI dependencies you use are compatible together.
Create the Endpoint
The next step is to create our Spring MVC endpoint for uploading a file. This isn’t always necessary; perhaps you will be loading audio files from a filesystem and don’t need an endpoint. If so, just consider this a convenient way to try out a few of Spring AI’s features.
I’ll create one like this:
@RestController
public class TranscriptionController {
@PostMapping(“/transcribe”)
public String transcribe(@RequestParam(“audio”) MultipartFile
audio) {
return audio.getOriginalFilename();
}
}
This is a good start to ensure that we’ve got the plumbing working. It’s especially a good idea if you are new to Spring or Spring MVC. Just one thing at a time, Josh!
Then, we can run the application from the `SpringToTextApplication` file or from the command-line like so:
./gradlew :bootRun
It will start up on port 8080, and then you can use HTTPie to upload an audio file like so:
http -f POST :8080/transcribe [email protected]
It should output the response:
my-audio.mp3
You can actually use any file you want at this point, it doesn’t have to be an audio file. If you want, you can use this audio file as that’s the one I’ll use very soon.
Activate OpenAI Integration
We’re going to be integrating with OpenAI’s speech-to-text support. To activate Spring AI’s OpenAI API, we need to add an OpenAI API key. This is something that you obtain at https://api.openai.com.
You add the key in Spring’s `application.properties` file like so:
spring.ai.openai.api-key={{ your API key }}
With that, the next time we run the application, Spring will publish the OpenAI beans that we need to perform audio transcription.
Call the Transcription API
Now, we’re ready to cook! We’ll need Spring AI’s `OpenAiAudioTranscriptionModel` component, so let’s update `TranscriptionController` to depend on it:
@RestController
public class TranscriptionController {
private final OpenAiAudioTranscriptionModel model;
public TranscriptionController(OpenAiAudioTranscriptionModel
model) {
this.model = model;
}
// …
}
Then, we formulate the transcription request and parse the transcription response inside the `transcribe` method.
// …
@PostMapping(“/transcribe”)
public String transcribe(@RequestParam(“audio”)
MultipartFile audio) {
Resource audioResource = file.getResource();
OpenAiAudioTranscriptionOptions options =
OpenAiAudioTranscriptionOptions.builder().build();
AudioTranscriptionPrompt request =
new AudioTranscriptionPrompt(audioResource, options);
AudioTranscriptionResponse response =
this.model.call(request);
return response.getResult().getOutput();
}
And that’s it! (Though stay tuned for a bit more from `OpenAiAudioTranscriptionOptions`.)
Now, if you restart the application, you can upload a new file and see it transcribed:
http -f POST :8080/transcribe [email protected]
I saw an angel in the marble, and I carved until I set him free.
Transcribe and Translate
One of the cooler features, I think, is the ability OpenAI has to return the audio in a different language than it was originally recorded.
Let’s add this feature to our web application by allowing the application to accept `language` as a parameter as well:
public String transcribe(@RequestParam(“audio”) MultipartFile audio,
@RequestParam(name=“language”, defaultValue=”en”) String language) {
// …
OpenAiAudioTranscriptionOptions options =
OpenAiAudioTranscriptionOptions.builder()
.withLanguage(language)
.build();
// …
After restarting the application, you can now do:
http -f POST :8080/transcribe language=es [email protected]
Viste a un ángel en el mar, y lo puse a cartar hasta liberarlo.
Which isn’t perfect (Michaelangelo didn’t see an angel in the sea!), but is definitely a long way towards recording your content in one language and making it available in others.
Sync Audio and Transcription (VTT)
The last thing we’ll demonstrate is Spring AI’s support for VTT, another one of the audio transcription options. To see this one in action, we’ll need a slightly longer audio file.
To support this, add another `@RequestParam` to the file like so:
public String transcribe(@RequestParam(“audio”) MultipartFile audio,
@RequestParam(name=“language”, defaultValue=”en”) String language,
@RequestParam(name=”format”, defaultValue=”vtt”) TranscriptResponseFormat
format) {
// …
OpenAiAudioTranscriptionOptions options =
OpenAiAudioTranscriptionOptions.builder()
.withResponseFormat(format)
.withLanguage(language)
.build();
// …
}
Then we’ll upload a file with a longer audio:
http -f POST :8080/transcribe format=vtt [email protected]
WEBVTT
00:00:00.000 --> 00:00:02.400
Yeah, well, I learned a very important lesson this week.
00:00:02.400 --> 00:00:06.600
Sometimes you fall in love and you think it feels that way forever.
00:00:06.600 --> 00:00:10.900
You change your life and ignore your friends cause you think you can't get any better.
00:00:10.900 --> 00:00:15.200
But then love goes away, no matter what it does it stays strong.
// …
Holy smokes! Spring AI just made it possible in a few lines to add transcription, time codes, and translation to my application.
Transcriptions Made Simple
In this article, you saw a preview of the amazing things that Spring AI can make available to you and your AI backends through its transcription abstraction. Not only can it take a number of file types, but it can also add timecodes and it can translate with just a few parameter changes. Want to learn more? Explore the Core Spring learning path to deepen your Spring expertise or view other Pluralsight AI courses and paths to discover more ways to integrate AI into your applications.
Note that at the time of this writing, Spring AI is nearing its 1.0 release! So now is an awesome time to give them feedback for what changes you’d like to see.
Check out the GitHub repo for this sample in action.