Nari Text-to-Speech Synthesis
Nari released an AI conversation generator that's not only impressive but pushes the boundaries of what's possible with open-source audio AI. With my experience in media development, I was intrigued when asked to test and review Dia from Nari Labs. Despite being in what I consider early beta stages (released just a week ago), this tool demonstrates remarkable potential for creating natural, emotionally expressive conversations between multiple speakers without requiring an internet connection.

The Open Source Audio AI Revolution
It's not only in the Open Source visual AI community that something is brewing. The Open Source audio world is also working on new AI releases. I was asked if I could test and review Dia from Nari Labs, and I decided to give it a try, though I consider this tool to be in its early beta stages, especially considering it was released just a week ago.
Nari Labs has released Dia, which can generate a conversation between two speakers using a transcript. You could say it works like your local podcast tool, a feature we also experienced with Google NotebookLM when it was first released on July 12, 2023. Google NotebookLM is limited to two speakers and only supports English. However, Google NotebookLM can turn any PDF text document into a podcast, which I have to say is quite impressive. Though you can't share PDF documents with Dia, the Gradio interface provides an easy start, allowing you to quickly get the dialogue going.
But Dia goes beyond your typical podcast, it will literally act on emotions and instructions that you give it. This is still very new to the AI audio world, where audio generators are quickly becoming mainstream. However, Nari Labs is pushing the boundaries, and I am sure many people have been waiting for this, as getting an AI TTS to change the pitch of its voice has been a struggle for all of us when generating voices through text.
Installing Nari Locally
I nearly always clone the Open Source repositories from GitHub using Git Clone. Early releases are often not found in applications like Pinokio, which is a free tool that makes installing these Open Source models much easier. But if it is an early release and there is no easy way, then the only option is the hard way, getting the application up and running yourself. However, the developers are usually very helpful in providing useful guides to get you started. Click here to access the Nari Labs Dia repository.
Hardware & Environment
For this installation, I'm using Windows 11 with Python 3.12 and an NVIDIA 4070 Ti Super with 16 GB VRAM. While the developers state that Nari works with PyTorch 2.0+, I had to downgrade from PyTorch 2.7 to 2.6 to work properly with CUDA 12.6. This is a common challenge with early-stage open source AI tools that may require specific version compatibility.
To set up Nari Labs Dia on your local machine, follow these steps.
Check & Uninstall Current PyTorch (If Needed)
If you encounter compatibility issues, you may need to uninstall your current PyTorch installation:
pip uninstall -y torch torchaudio
Install Correct PyTorch Version
Install PyTorch 2.6.0 with CUDA 12.6 support:
pip install torch==2.6.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
Note: PyTorch download requires about 2.5 GB of space, and CUDA may require an additional 3 GB if you need to install it.
Verify PyTorch and CUDA Installation
Create a verification script to check your PyTorch and CUDA configuration:
# Save as verify_cuda.py import torch print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"CUDA version: {torch.version.cuda}") print(f"Device name: {torch.cuda.get_device_name(0)}") else: print("CUDA is not available. Check your installation.")
Run the script to verify your installation:
python verify_cuda.py
You should see output similar to:
PyTorch version: 2.6.0+cu126 CUDA available: True CUDA version: 12.6 Device name: NVIDIA GeForce RTX 4070 Ti SUPER
Run Dia
From your installation directory, launch Dia with CUDA support:
python app.py --device cuda
Adding the --device cuda parameter forces Dia to use your GPU instead of CPU, which significantly improves performance.
The installation process isn't always smooth, as I discovered during my own setup. Here's an example of an error I encountered and how I resolved it:
python app.py --device cuda Using device: cuda Loading Nari model... Error loading Nari model: Torch not compiled with CUDA enabled
After reinstalling PyTorch with the correct CUDA support, I was able to run Dia successfully:
python app.py --device cuda Using device: cuda Loading Nari model... AppData\Roaming\Python\Python312\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. WeightNorm.apply(module, name, dim) Launching Gradio interface... * Running on local URL: http://127.0.0.1:7860
When developers write "2.0+", it doesn't guarantee that you can get it up and running without issues, as it doesn't consider future updates or releases from other third-party developers. But that's how the Open Source world works. Sometimes you need to make small adjustments to get it running locally.
Using Dia Online vs. Locally
Nari Labs does offer an online version of Dia, but you will have to sign up for a waiting list to access the tool. I can easily understand if you are more comfortable working with their online tool, as running Open Source software in its very early stages can be a hassle due to bugs and workarounds you will sometimes encounter and have to fix yourself. However, be aware that every prompt and every audio file you share online is stored online, where it can be processed for further training. Another scenario is if a website is hacked, intruders will most likely gain access to this data. So, there's definitely something you need to consider before sharing your data online. To access the tool online, you must follow this URL: https://tally.so/r/meokbo
Online vs. Local. Privacy Considerations
While the online version of Dia may be easier to use, running the tool locally ensures complete privacy of your data and generated audio. No data leaves your computer, giving you full control over your work.
Visit GitHub RepositoryCreating Conversational Audio with Dia
Dia works as a dialogue between two or more people. To separate the speakers into their respective dialogues, you have to use the reference [S1] for Speaker one and [S2] for Speaker two. Below, I can share a quick example that you can copy and paste into your Input Text area:
[S1] What a great day it is today. [S2] Yes, it's great to be outdoors and just enjoy the great weather. [S1] I couldn't agree more. [S2] Well see you around.
Talking about the weather is usually a great kick-starter for any conversation, and it works great here as well. An amazing thing I immediately noticed was the emotional pitch in their voices, which I have to say takes it a step further than commercial paid versions such as ElevenLabs, who are still struggling to fully find a way to incorporate emotional voice pitches into their generated speech.

You will manually have to figure out the token limits for short and longer conversations, and there doesn't seem to be a rule of thumb for the maximum amount of seconds you can get. The lowest token value is 860, and the highest is 3072. The more you increase the token value, the longer it will take to generate the speech, but at the same time, you will be able to increase the conversation length by a certain amount of seconds. But let's not forget that this is a 1.6B parameter model, which is considered a small-scale model in the world of LLMs. Still, it's quite impressive, and it takes about 1 to 2 minutes to generate speech using the highest token number when running it on my local setup.
When testing the Speed Factor, I noticed a slower speed, but the pitch of the speakers' voices also changed. I hope to see a fix for this in the future.
Supported Non-Verbal Cues
A stroke of genius is that Dia supports generating non-verbal cues. The following verbal tags will be recognized:
Conversational Quality
The following audio sample demonstrates the conversational capabilities of Dia, including its ability to convey emotional nuances and natural-sounding dialogues between multiple speakers.
It outputs the audio file with a sample rate of 44100 kHz, which is a high-quality output.
Prompt Tip: Generate Conversations with AI
You can also let any Chatbot help you build conversations to use inside Dia. To try this out, just copy the instructions below:
I want you create a conversational joke between two speakers in a podcast. We will divide the speakers into references. [S1] is reference one for the first speaker [S2] is reference two for the second speaker If you use any verbal tags you are limited to choosing between the tags below: (laughs) (clears throat) (sighs) (gasps)(coughs) (singing)(sings) (mumbles) (beep) (groans) (sniffs) (claps) (screams) (inhales) (exhales) (applause) (burps) (humming)(sneezes) (chuckle) (whistles)
High-Quality Output
Dia generates audio at 44.1kHz sample rate, providing professional-grade sound quality for your conversations.
Emotional Expression
Unlike many commercial TTS systems, Dia naturally conveys emotional nuances through voice pitch and tone variations.
Privacy-Focused
Running completely offline after initial setup ensures your data and generated content remain private.
As always, I like to test if a tool only works online or if it supports working offline. The reason why it's important to test these things is that privacy can be a concern for quite a few people and companies. Dia works like a charm offline, so this is a tool that I will consider safe for privacy purposes. When comparing it with a tool like ElevenLabs, where you immediately share your prompts and audio files, which their models can then be trained on (they do offer a service to opt out of their training program, but whether that fully works is something I am not entirely sure about), what I am sure about is that when you run your AI tools locally and offline, you are guaranteed privacy.
The Conversation Outcome
Being an AI tool, you never fully know the outcome, but the conversational style does sound very good. However, there are sometimes breaks and sudden stops that shouldn't be there, and sometimes it will not follow the whole transcript you have shared. When this happens, consider tweaking the token amount based on the conversation length to see if this can improve the outcome.
Feature | Nari Dia | ElevenLabs | Google NotebookLM |
---|---|---|---|
Emotional Expression | Strong | Limited | Limited |
Offline Operation | Yes | No | No |
Voice Variety | Limited | Extensive | Limited |
Non-verbal Cues | Yes | Limited | No |
Languages | English only | Multiple | English only |
PDF Processing | No | No | Yes |
If you are a marketing department with access to all applications in the Adobe Creative Cloud environment, you can make quick fixes to your audio files by making small edits using their Audition tool, which I have to say is my preferred audio editing tool. However, there are freeware tools such as Audacity which also works great for editing audio, and this software even supports AI features.
Licensing and Usage Rights
The Apache License 2.0 only covers the software and explicitly does not govern the outcome or content generated by the user. There are no other licenses, terms, or conditions provided by the developers that apply to the generated MP3 output, however, this does not mean that their license agreement cannot be updated in the future.
Commercial Usage Rights
You are free to use the generated MP3 commercially because:
- The MP3 is a derivative work of your copyrighted input text.
- The Nari Labs Dia software license (Apache 2.0) allows you to run the tool.
- The tool's developers have not imposed any separate restrictions or licenses on the output generated by the tool.
It's always good to follow up on any license updates, especially when you want to use it commercially, as the developers reserve the right to make changes to their license terms at any time. I always recommend that if you are not fully sure what you are allowed to do, you contact the developers directly: Nari Labs.
This license check was made April 29, 2025.
Applications and Future Potential
The potential applications for Nari's Dia technology extend well beyond simple conversational audio generation. Here are just a few possibilities:
Game Development
Generate dynamic NPC conversations with emotional depth for immersive gaming experiences, reducing the need for extensive voice actor recording sessions.
Podcast Production
Create podcast demos, test different script approaches, or generate complete shows with natural-sounding conversations between multiple speakers.
Accessibility Tools
Enhance reading applications for visually impaired users with more engaging and natural-sounding conversational voices instead of monotone narration.
Interactive AI Assistants
Create more engaging AI assistants with emotional intelligence expressed through their vocal patterns and non-verbal cues.
Animatic Voiceovers
Generate placeholder dialogue for animation pre-visualization to better communicate timing and emotional tone before hiring voice actors.
Language Learning
Create interactive dialogue scenarios with emotional context for more effective language learning applications.

Nari vs. Commercial Solutions
When comparing Nari Dia to commercial solutions like ElevenLabs or Google's offerings, several key differences emerge that may influence which tool is right for your specific needs.
Nari Dia Advantages
- Superior emotional expression with natural pitch changes
- Completely offline operation after initial setup for total privacy
- One-time setup with no recurring subscription costs
- Advanced non-verbal cue support (laughs, sighs, etc.)
- Freedom to use generated audio for commercial purposes
- Open-source architecture allows for community improvements
Current Limitations
- English language support only (no multilingual capabilities)
- Occasional timing issues with breaks and sudden stops
- Technical setup required compared to cloud-based services
- Limited voice variety compared to commercial alternatives
- Speed adjustments affect voice pitch
- Small model size (1.6B parameters) limits complexity handling
Editor's Recommendation
If privacy and emotional expressiveness are your primary concerns, Nari Dia provides capabilities that currently surpass commercial alternatives. For production environments requiring multilingual support or extensive voice options, you may need to combine Dia with commercial solutions until the open-source ecosystem matures further.
Conclusion
I always welcome bright minds who dare to think outside the box, and I have to say that Nari Labs is contributing to something that could easily find its way into other software such as games, help functions, services for disabled people, etc. The opportunities are huge, and as an open-source model, they are challenging big players like ElevenLabs and Google, though I have to say that these two companies are still miles ahead. But with the speed of AI, you never really know who the top contender will be tomorrow. Just like Google NotebookLM, Nari Labs Dia only supports English audio. There are no community notes on whether other languages will be supported, but this will undoubtedly require different LLMs and a lot more training.
The tool is very easy to work with, and as an early release, they have done an amazing job. Of course, we as users can always point out areas for improvement, but in cases like this, I would rather put on my cape of encouragement and hope to see many more updates from these developers. I will certainly look forward to following them and seeing what is next on their agenda. I would also like to thank them for contributing to the Open Source community, giving people like me and many others the opportunity to examine and study what we can expect from the future.
Keep up the good work.

Discussion
Join the conversation! Please log in or sign up to comment.