At my job, we still use PowerPoint avatars and cartoon animations to walk people through software and business tutorials. They get the job done, but they feel outdated, robotic, and a little hard to take seriously. So I started thinking about how to create walkthroughs that feel more natural and easier to follow, without turning the process into something overly complicated. That’s when I decided to try building a quick tutorial using HeyGen, paired with an avatar I designed in Midjourney and voice narration from ElevenLabs. No actors, no studio, no overdone animations. Just a clean, approachable prototype that feels a little more human. For this experiment, I focused on a simple but very common question: how to change your video background in Microsoft Teams when you’re not in a scheduled meeting.
I didn’t want to just create something that looked nice. I wanted to see if switching to an AI-based walkthrough could actually make things clearer. So I picked a real-world task that comes up all the time, at least for me as it’s a question my co-workers message me about. It’s one of those little things that most people don’t think about until it’s too late, especially when they’re trying to join a call quickly and suddenly realize they’re about to broadcast their laundry pile. It’s a small moment, but a great example of where a fast, friendly tutorial can save time and a little bit of dignity.
The Process
To pull everything together, I used Midjourney to create the avatar design, ElevenLabs for the chosen voice, and HeyGen’s new Avatar IV for my avatar’s movement and lipsync. I recorded the screen share using OBS Studio, which gave me clean, high-quality footage to work with. Then I brought everything into CapCut for the final edit. It’s simple, fast, and didn’t require a deep dive into layers or keyframes just to sync the visuals with the voice. I was able to line up the audio, drop in my video clip, and fine-tune the timing with a few easy clicks. CapCut also made it easy to add captions and transitions without overcomplicating the process. It gave me just enough control to make things look polished, without slowing me down.
Cynthia
Let’s talk about the avatar. I chose a female-presenting character with a mixed-race appearance. That wasn’t random. In professional settings, especially here in the U.S., people tend to respond better to faces that feel approachable and relatable. Female-presenting avatars are often seen as more trustworthy in training roles, and choosing someone who reflects a broader range of backgrounds helps more people feel like the content is for them. Research from Nielsen, Pew, and Stanford’s Social Media Lab backs this up, showing that viewers are more likely to stay engaged when the face on screen feels representative or neutral in a multicultural setting. We process faces quickly, often before we hear a single word, so I wanted someone who looked like they belonged in the room and not just another stock character in a suit.
Meet Cynthia, created with the help of Midjourney.


Lip-Sync
Kling AI had me… almost. I thought I’d be able to use it for both the avatar and the lip-sync, but after a few test runs, the mouth movements felt like they were stuck in slow motion. It wasn’t terrible, but it also wasn’t good enough to keep. That said, when it comes to lifelike motion, Kling is still one of my top three image to video tools. The facial movement and realism are impressive. The lip-syncing just isn’t there yet. So I rerouted. I built the avatar in Midjourney, then used HeyGen to animate her and sync the voice from ElevenLabs. HeyGen and ElevenLabs pair up nicely as there’s a built in integration between the two. To be transparent, that pivot cost me a little time since I hadn’t used HeyGen before, but I found it surprisingly easy to learn. The result felt smoother, more lifelike, and way less robotic.
The Uncanny Valley
One thing I can’t ignore is the Uncanny Valley effect. It’s that odd feeling you get when something looks almost human but doesn’t quite move or talk like one. Your brain expects it to behave like a real person, and when it doesn’t, like when the mouth is just a beat off or the expressions feel a little robotic, it creates this quiet little mental glitch. That’s a bit of cognitive dissonance in action. You know it’s not real, but it’s trying so hard to be, and that mismatch pulls you out of the moment. Tools like HeyGen are getting better, especially with their latest Avatar IV, but the lip-sync, overall, still has some catching up to do. It’s not unsettling enough to ditch the whole thing, but it’s a reminder that we’re not fooling anyone just yet.
Future Improvements
There are definitely a few things I’d love to improve in future versions. For one, I really want to add a pulsing or animated trail to the cursor to help guide the viewer’s attention during screen recordings. I used OBS for the screen capture, and since it was my first time working with it, there’s still a lot I need to learn. I also had plans to add some sort of effect around the face bubble during the screen share, but trying to pull that off in CapCut felt more complicated than it was worth for this project. It’s something I’ll save for another tutorial when I have time to really dig into it. While CapCut is great for short-form content like TikTok videos, I’ve found that for longer tutorials or anything that needs more polish and control, Adobe After Effects or Premiere will probably be my go-to moving forward. Perhaps its my familiarity, I feel that there’s just more ability to fine-tune things the way I want.
Conclusion
The entire process had a pretty steep learning curve, especially since it was my first time using both HeyGen and OBS Studio, and my first time using CapCut for something like this. But now that I’m familiar with the workflow, the speed has definitely improved. Once the avatar is set, I could comfortably create a 2 to 3 minute tutorial in about 2 to 3 hours. That means producing 4 to 5 videos in a day is feasible, depending on the complexity of the topic. Not bad for something that started as an experiment.
This whole project was a test, not just of tools, but of process. I wanted to see if I could create something that felt more human, more helpful, and still easy to build without needing a production crew. Between Midjourney, HeyGen, ElevenLabs, and CapCut, I found a workflow that’s fast, flexible, and way more engaging than a PowerPoint avatar with jazz hands. It’s not perfect, but it’s a step toward better tutorials that don’t just show people what to do, but actually work for the people watching them.



