Do I need an iPhone or special hardware to animate a MetaHuman’s face?

No, an iPhone with TrueDepth camera and Live Link Face app is popular for capturing facial animation due to MetaHumans’ ARKit compatibility, but alternatives exist. You can manually keyframe in Unreal’s Control Rig, use a webcam or video with Faceware Studio, or employ AI tools like Audio2Face with an audio file and NVIDIA GPU, achieving great results without specialized hardware.

Can MetaHumans lip-sync automatically to any voice audio?

Yes, with tools like MetaHuman Animator (UE 5.2+), which generates facial animation from audio, NVIDIA’s Audio2Face for AI-driven lip-sync, or plugins like OVRLipSync for phoneme analysis. Unreal Engine requires setup or add-ons for automatic lip-sync, as it lacks a one-click solution without MetaHuman Animator (available in 5.5). Results are high-quality but may need animator polishing.

What are visemes and why are they important for MetaHuman speech animation?

Visemes are visual mouth shapes corresponding to speech sounds (phonemes), like top teeth on bottom lip for “F”/“V”. MetaHumans’ rig includes visemes (e.g., “mouthClosed”, “mouthOpen”) as blendshapes. Animating visemes in sync with audio phonemes creates lip-sync, simplifying animation by focusing on key shapes. Visemes are crucial for troubleshooting and ensuring convincing speech visuals.

How do I use Live Link Face with MetaHumans in Unreal Engine?

Using an iPhone (X or later) with the Live Link Face app, connect it to your PC on the same network and enable Unreal’s Live Link plugin. In the app, set the PC’s IP as the target. In Unreal’s Live Link window, add the iPhone as a source. Set the MetaHuman’s Blueprint or Anim BP to the app’s default subject name (“iPhone” or “iPhoneFaceSubject”). The MetaHuman mirrors facial movements in real-time, recordable via Take Recorder for playback or refinement, capturing ARKit blendshapes at up to 60fps.

Faceware vs. Live Link Face – which is better for MetaHuman facial animation?

Live Link Face is cost-effective (requiring only an iPhone) and easy to set up for quick real-time results, ideal for previews or final animations, but may lose tracking or miss subtle eye motions. Faceware, a paid professional tool, tracks video-based movements with machine learning, handling complex orientations and nuances with less cleanup, suitable for high-end film/games. Hobbyists find Live Link Face sufficient, while some use both, Live Link Face for drafts and Faceware for key moments.

Can MetaHumans talk in real-time in a game (during gameplay, not just cinematics)?

Yes, MetaHumans can talk in gameplay using pre-made animations triggered via Animation Montage or Sequencer for dialogues. Dynamic speech (e.g., unscripted lines) requires runtime lip-sync plugins like OVR LipSync. With LODs and limited on-screen MetaHumans, performance is manageable, as seen in Fortnite events. Precomputed animations ensure quality without heavy CPU/GPU load, though tools like Convai’s SDK enable cutting-edge dynamic speech.

The MetaHuman’s mouth sometimes stays a bit open or doesn’t close fully on certain sounds – how do I fix that?

If the “M/B/P” viseme doesn’t fully close the mouth, keyframe the jaw/lips to closed in the curve editor or Control Rig during silent gaps or relevant phonemes. Check the neutral pose for slightly parted lips and adjust if needed. For solvers or mocap, manually boost closures (e.g., soft “m”). Ensure no conflicting animation layers (e.g., half-smile) keep lips apart. Optionally, use Animation Blueprint to enforce closures for specific phonemes, slightly exaggerating for CG clarity.

How can I make the MetaHuman’s speech animation more expressive?

Enhance expressiveness by adding emotional facial expressions (e.g., smiles for happiness, furrowed brows for anger), natural eye blinks, and subtle eye darts or squints for emphasis. Incorporate head tilts, nods, or body gestures like shrugs, varying viseme timing and intensity to mimic human speech nuances (e.g., slurring or emphasizing syllables). Add secondary motions like cheek shakes for loud speech, using reference from charismatic speakers. Perform an expression pass for brows and eyes after lip-sync to make the performance feel like acting.

Is it possible to drive MetaHuman lip-sync from text (without recorded audio)?

Yes, indirectly by converting text to speech (TTS) audio and generating visemes. Plugins like “Text To LipSync” map text to phonemes and visemes, while AI services like Amazon Polly or Convai’s plugin provide phoneme timings for animation blueprints. TTS audio with viseme timing is simpler and includes playable sound, ideal for dynamic dialogues, though expressiveness depends on TTS quality and emotional markup.

Are MetaHumans and their facial animations suitable for use in films or is it only for games?

MetaHumans are suitable for film-quality work, designed for cinematic detail with realistic skin, eyes, and facial rigs for close-ups, as shown in the Blue Dot short film. Used in virtual production for TV/film, they serve as digital doubles or cinematic characters, rendered at high quality. Indie filmmakers use them for cost-effective shorts, and with proper animation, MetaHumans match film needs, though custom models may be needed for specific stylized or ultra-real faces.

April 5, 2025
Article
animation, cinematic, Facial Animation, Facial Rig, Metahuman, Storytelling, Talk, Unreal Engine, Unreal Engine 5, Voice

How to Make a MetaHuman Talk: Ultimate Guide to Voice and Facial Animation

yelzkizi Why Does Hair Look Bad in Video Games? Exploring the Challenges of Realistic Hair

Why Does Hair Look Bad in Video Games? Exploring the Challenges of Realistic Hair Rendering

March 14, 2025

How to Create Metahuman Child Characters and Convert MakeHuman Models to Metahuman in Unreal Engine 5

May 5, 2025

yelzkizi the view keeper 3d blender addon

The View Keeper Add-on: Why Every Blender Artist Needs It

February 16, 2025

The View Keeper vs. Manual Camera Switching: Which is Better?

February 16, 2025

yelzkizi How to Make Stylized Anime Hair Flow Naturally in Blender

How to Make Stylized Anime Hair Naturally in Blender

February 26, 2025

PixelHair Realistic Dreads 4c hair in Blender using Blender hair particle system Hair roughness

Why PixelHair is the Best Asset Pack for Blender Hair Grooming

February 16, 2025

Blender Roll Hair Curves Geometry Nodes Preset: Comprehensive Guide

March 20, 2025

Understanding Interpolate Hair Curves geometry nodes preset

March 22, 2025

Why Hair Grooming Takes So Long and How to Speed It Up

February 26, 2025

What Are the 7 Basic Camera Movements in 3D Animation? A Complete Guide

March 23, 2025

Introduction

Realistic voice and facial animation are vital for MetaHumans, Epic Games’ hyper-realistic digital humans, to transform them into believable characters. Lifelike speech conveys emotions and story, enhancing narrative immersion, while realistic facial animation, a key aspect of visual effects, supports compelling characters and immersive storytelling. Without accurate lip-sync and expressive gestures, even photorealistic MetaHumans appear lifeless. In games, films, and VR, convincing dialogue delivery is essential for audience belief. This research article explains MetaHumans, offering a step-by-step guide on animating their speech using real-time techniques (for game engines and live applications) and pre-rendered cinematic workflows. It details achieving high-quality voice-linked facial animation with Unreal Engine tools, AI-driven solutions, expert tips, troubleshooting, and future trends in this evolving field.

Understanding MetaHuman Technology

MetaHumans are Epic Games’ framework for crafting realistic digital characters in Unreal Engine using the cloud-based MetaHuman Creator. Artists select presets and tweak facial features, skin, and hair to create custom characters, which are fully rigged with sophisticated skeletons and control rigs for immediate animation. Downloaded via Quixel Bridge, MetaHumans import as assets (body, face, materials) into Unreal Engine, featuring advanced facial rigs with controls and blendshapes derived from real scans, including 52 Apple ARKit-standard morph targets for expressions and speech visemes.

In Unreal Engine, MetaHumans integrate with the animation system via blueprints and control assets, like the Face Animation Blueprint, which applies facial motion capture data (e.g., ARKit, Live Link) to the face. They include eight Levels of Detail (LODs), from high-detail close-ups with full rigs to simplified distant versions, enabling both cinematic rendering and real-time gameplay scalability.

MetaHumans excel as digital characters due to their user-friendly creation process and photorealistic results, requiring no custom rigging. They capture subtle expressions, like smirks or frowns, with detail, making them ideal for storytelling. Proper animation, especially for dialogue, is key to their potential, with tools and methods explored next.

Yelzkizi how to download metahuman: the ultimate step-by-step guide to accessing digital humans — How to make a metahuman talk: ultimate guide to voice and facial animation

Preparing a MetaHuman for Voice Animation

To prepare a MetaHuman for talking in Unreal Engine, follow these steps:

Create or Import a MetaHuman: Design a character in MetaHuman Creator or select a preset, then download it into Unreal Engine via Quixel Bridge, ensuring the MetaHuman plugin is enabled and Unreal Engine 5+ is used. The downloaded MetaHuman Blueprint includes the body mesh, face mesh, and materials.
Place the MetaHuman in a Level: Add the MetaHuman Blueprint to a level or scene. It includes a face skeletal mesh with a Face Animation Blueprint (Face_AnimBP) linked to control curves (e.g., ARKit blendshapes) for facial animation via Live Link or animation sequences. Confirm the Face_AnimBP is active. For cinematics, create a Level Sequence in the Sequencer to animate the MetaHuman.
Prepare Facial Animation Controls: For manual or refined animations, enable the MetaHuman Control Rig for the face, which provides controllers for eyes, jaw, lips, and tongue. Access it through the face skeletal mesh in Control Rig mode or the MetaHuman Face Control Board in the Animation Editor for keyframing facial movements.
Enable Live Link (for real-time capture): If using live facial motion capture (e.g., iPhone), enable the Live Link and Live Link Face plugins (or others like Faceware Live). In the MetaHuman Blueprint, ensure a Live Link Face Subject (e.g., “iPhone” or “iPhoneBlack” for ARKit) is assigned. For non-real-time methods, confirm the face blueprint supports the chosen animation input (e.g., Audio2Face or Sequencer tracks).
Audio Setup: Import high-quality audio (WAV, 16 kHz or higher, minimal noise) into the Content Browser. Add the SoundWave asset to the Sequencer timeline to sync facial animations with the audio. This is crucial for automated tools like MetaHuman Animator or NVIDIA’s Audio2Face.

Once the MetaHuman is set up with rigging and audio, you’re ready to proceed with lip-sync and facial animation methods, which will be detailed later.

Tools for Lip Sync and Facial Animation

There are multiple tools and techniques available to animate a MetaHuman’s face and achieve lip sync. We will cover the key options, from Unreal’s built-in tools to third-party software. Each tool has its strengths, and they can be used individually or even combined for the best results. Here’s an overview:

Unreal Engine’s Built-In Tools (Control Rig, Sequencer, Live Link Face)

Unreal Engine provides built-in tools for MetaHuman facial animation, which include:

Control Rig: A node-based system controlling the MetaHuman’s facial bones and blendshapes, with a pre-made asset offering controls for brows, eyelids, jaw, lips, and more. Animators can keyframe these in Sequencer to create poses, like opening the jaw for an “Ah” sound or raising cheeks for a smile. Though manual keyframing is time-consuming, it’s useful for refining mocap or automated animations, adjusting lip positions, or crafting phoneme shapes (visemes).
Sequencer: Unreal’s timeline editor for cinematics and scripted animations, enabling keyframing of facial expressions and lip movements or playback of recorded animations. It handles Animation Sequences or Level Sequences from recorded performances, imported animation data, or offline bakes. Sequencer is vital for syncing dialogue, adjusting timing, and blending animations, such as combining lip-sync with additional expressions.
Live Link Face (ARKit Facial Capture): A real-time tool using the Live Link Face app on iPhone/iPad with Apple’s ARKit to track and stream facial movements to Unreal Engine. MetaHumans are compatible out-of-the-box, using ARKit blendshape names (e.g., “mouthSmile_L”, “jawOpen”). An actor’s performance, captured via a head-mounted or handheld iPhone, drives the MetaHuman’s expressions instantly. Data streams as a Live Link Subject to the MetaHuman’s animation blueprint, with recordings savable via Take Recorder for playback or tweaks. It’s ideal for quick previews, final animations, virtual production, or interactive characters. Other devices, like OpenXR face trackers or VR headsets, can also feed data to Live Link, though they’re less common.

These tools allow realistic results within Unreal: Live Link Face captures performances, which Sequencer and Control Rig or the Animation Editor can refine. MetaHuman Animator (introduced in UE 5.2+) enhances fidelity using iPhone video and depth data, to be covered later. MetaHumans can speak via live-captured performances or keyframed animations using these native tools.

Yelzkizi best places to buy unreal engine metahuman assets: top marketplaces, reviews, and expert tips — How to make a metahuman talk: ultimate guide to voice and facial animation

Faceware (Professional Facial Motion Capture)

Faceware is a professional facial motion capture solution used in VFX and games, with Faceware Studio analyzing facial movements from video (live or pre-recorded) using machine learning. It’s camera-agnostic, working with webcams or high-quality cameras, though better cameras improve results. For Unreal Engine integration, Faceware uses a Live Link plugin (e.g., Glassbox Live Client). The workflow involves recording or streaming an actor’s facial performance, processing it in Faceware Studio to generate animation data, and streaming it to Unreal via Live Link to drive a MetaHuman’s face rig, similar to ARKit data.

Faceware is employed by studios like Scanline VFX and Hazelight for its robust single-camera tracking and nuanced expression capture. It’s ideal for video-based capture without depth sensors, and Epic partnered with Faceware for a MetaHuman workflow course, showing strong support. The process in Unreal includes:

Enable the Faceware Live Link plugin (e.g., Glassbox Live Client).
In Faceware Studio, load video or live camera feed to track the performance, producing a live animation stream.
In Unreal’s Live Link panel, connect to Faceware Studio as a source and assign the MetaHuman’s Face animation blueprint to the Faceware Live Link subject.

The MetaHuman animates in real time with the actor’s performance, which can be recorded using Take Recorder for later editing. Post-recording, Unreal’s animation tools, like the Animation Sequence Editor or control curves, allow cleanup of Faceware’s high-quality data to perfect lip-sync or nuances like smiles and eyebrow raises.

As a markerless system, Faceware requires no facial dots, enhancing ease of use. It’s a paid solution needing its software alongside Unreal but is an industry standard, used in films like Dungeons & Dragons: Honor Among Thieves and games like Hogwarts Legacy. For MetaHumans, Faceware offers reliable, realistic talking animations, especially in professional pipelines using filmed actor performances.

Adobe Character Animator

Adobe Character Animator is a 2D-focused tool for facial capture and animation, primarily for cartoon avatars or live streaming, using a standard webcam to track head position, eyebrows, eyes, and mouth (including lip-sync from voice) in real time. While not designed for 3D MetaHumans, it can be used to prototype facial performances or drive cartoon-style characters. For MetaHumans, it’s relevant for budget-conscious users without access to tools like iPhone or Faceware. Character Animator can output face tracking data or MIDI parameters for blendshapes, which could be manually mapped to a MetaHuman’s controls to transfer performance timings. Alternatively, its automatic lip-sync feature generates viseme sequences from audio, which can serve as a reference for keyframing a MetaHuman’s mouth movements.

Though not a standard or well-documented approach, Character Animator shows that even a webcam can indirectly drive MetaHuman animations. Its simplicity, real-time face tracking and lip-sync without complex setup, makes it a beginner-friendly introduction to facial motion capture. However, most MetaHuman animators prefer Unreal’s built-in tools or specialized 3D solutions over this niche 2D tool.

Yelzkizi how to make a metahuman talk: ultimate guide to voice and facial animation — How to make a metahuman talk: ultimate guide to voice and facial animation

NVIDIA Omniverse Audio2Face (AI-Driven Lip Sync)

NVIDIA Omniverse Audio2Face is an AI-powered tool that generates facial animation from audio, enabling MetaHumans to talk without recording an actor’s facial performance. It uses a neural network to analyze speech and predict lip-sync and some expressions, producing synchronized 3D face animation in seconds or minutes. Part of NVIDIA’s Omniverse platform, Audio2Face works by loading a 3D head model (e.g., NVIDIA’s sample head or a MetaHuman head), inputting an audio file, and generating blendshape weights for lip-sync, with optional emotional variance (e.g., happy or angry tones). It integrates with Unreal Engine via a Live Link plugin, mapping Audio2Face’s blendshape data to a MetaHuman’s ARKit-based face rig for real-time animation streaming, runnable on a separate machine or NVIDIA’s Avatar Cloud Engine.

The tool’s lip-sync quality is high, often rivaling human-keyframed animation, though it may miss subtle micro-expressions. Recent updates improve emotion handling, add tongue animation, and support multiple languages like Mandarin. The workflow for MetaHumans includes:

Install NVIDIA Omniverse and Audio2Face (requiring a capable NVIDIA GPU).
For real-time streaming, install and enable the Audio2Face Live Link plugin in Unreal, add a Live Link source for Audio2Face, and include a Live Link Skeletal Animation component for the MetaHuman to receive blendshape streams.
In Audio2Face, use the sample head or configure a MetaHuman head, leveraging the plugin’s mapping for MetaHuman compatibility.
Play audio in Audio2Face to generate animation, streaming to Unreal in real time, animating the MetaHuman’s mouth, jaw, and cheeks, with adjustable settings like delay for sync.
Alternatively, export animation as a cache or USD from Audio2Face and import it into Unreal for the MetaHuman.

Audio2Face enables fast lip-sync without manual work or motion capture, ideal for prototyping or final use after refinement, such as generating NPC dialogue lip-sync. While it may occasionally produce minor timing or articulation errors, users often refine the initial 80% result in Unreal using Control Rig or corrective shapes. This serves as an introduction to Audio2Face, a leading AI-driven lip-sync solution for MetaHumans.

Other Third-Party Plugins and Software

Additional tools and plugins for animating MetaHuman speech include:

Reallusion iClone & Live Link: iClone, a 3D animation software, offers AccuLips for automatic lip-sync from text or audio and a Face Key editor for fine-tuning. Its Unreal Live Link syncs animations to MetaHumans. Users import a MetaHuman into iClone, generate lip-sync automatically or via puppeteering, and stream animations to Unreal in real time. It’s user-friendly for non-programmers with visual lipsync and face puppet tools but involves commercial costs and an extra step outside Unreal.
Oculus OVR Lip Sync: A free Unreal plugin analyzing audio waveforms to output viseme weights in real time, designed for VR avatars. It can drive MetaHumans by mapping visemes to blendshape curves, though it may require performance tweaks and be less accurate than offline solutions. It suits runtime lip-sync in games but uses a simpler algorithm compared to advanced AI like Audio2Face.
MetaHuman SDK by Convai: Convai’s plugin provides text-to-speech and lip-sync for MetaHumans, generating voice audio and matching facial animations for virtual AI characters. It’s suited for interactive experiences like real-time NPC dialogue, with broader AI implications to be discussed later.
Daz 3D / Face Mojo: Daz Studio’s Face Mojo plugin enables lip-sync from audio for its characters. While not for direct MetaHuman animation, it can create reference animations or animate other characters, retargetable to MetaHumans in Unreal if using ARKit blendshapes, though retargeting is complex due to rig differences.
https://metahuman.unrealengine.com/Custom Blueprints or Plugins: Developers create lip-sync solutions in Unreal using phoneme dictionaries, breaking dialogue into phonetic sounds and driving MetaHuman face curves via phoneme-to-viseme data tables. Marketplace plugins like “Text to Lip Sync” generate animations from text or subtitles using viseme pose assets. These require technical setup but process audio (sometimes via speech-to-phoneme libraries) to play viseme poses in sync with audio via Animation Blueprints.
Other Mocap Systems: Systems like Dynamixyz, Xsens Face (using iPhone), FaceX, and Rokoko (using iPhone or webcam via Studio software) capture facial data, often leveraging ARKit or proprietary algorithms. MetaHumans can ingest this data via FBX import or real-time streaming, suitable for high-end motion capture setups.

These tools complement Unreal’s native options and ARKit capture (free with an iPhone). Faceware provides industry-grade quality, Audio2Face offers AI automation, and iClone simplifies control. Emerging AI services may streamline processes further. Choosing depends on project needs, real-time performance, batch processing, or artistic control, setting the stage for a step-by-step guide to animate a MetaHuman’s speech from a voice recording.

Step-by-Step Guide: How to Make a MetaHuman Talk

This guide outlines a workflow for making a MetaHuman speak in Unreal Engine, using a prepared MetaHuman and a voice line, covering audio setup, lip-sync creation, and polishing for realism:

Step 1: Prepare a High-Quality Voice Recording
Obtain a clear voice recording (from a voice actor, text-to-speech, or existing audio) with at least a 16 kHz sample rate, minimal noise, and no echo for optimal lip-sync results. Import the audio as a SoundWave asset into Unreal Engine, add it to a Sequencer track or play it in the level for animation reference. For long dialogue, split it into smaller phrases for easier syncing.
Step 2: Choose a Lip-Sync Method and Generate Initial Animation
Select a method to create the initial lip-sync animation:
- Option A: Live Performance Capture
  Use an iPhone or supported device (e.g., Live Link Face, Faceware Studio, Rokoko) to capture an actor’s performance. Connect the device to Unreal via Live Link, ensuring the MetaHuman’s Animation Blueprint listens to the correct subject. Play the audio for the actor to mimic, record the facial motion using Take Recorder, and save it as an Animation Sequence to align with the audio in Sequencer. Multiple takes may be needed for timing accuracy. Optionally, record audio and facial data simultaneously via Live Link Face.
- Option B: Automated Lip-Sync from Audio
  Use tools like MetaHuman Animator (introduced in UE 5.5) to generate facial animation from audio by selecting the MetaHuman and audio in a MetaHuman Performance asset and processing it. Alternatively, use NVIDIA Audio2Face to stream or import animation, or plugins like OVR LipSync or “Text to Lip Sync” for phoneme-to-viseme conversion. Apply the resulting Animation Sequence in Sequencer or record live-streamed data, providing a foundational lip-sync that may need refinement.
- Option C: Manual Keyframe Animation
  Hand-animate for full control, using a video reference or waveform to identify phoneme timing. In Sequencer, set a neutral face pose at frame 0, then keyframe visemes (e.g., “O” for round lips, closed lips for “M”/“B”, wide lips for “EE”) on the Control Rig or facial curves, focusing on major mouth shapes, jaw opening, lip pressing, and tongue positioning. Adjust timing to align visuals with audio, adding emotions like smiles or frowns. This method is time-intensive but precise for short lines.
  After applying one option, verify the initial lip-sync in Sequencer to ensure the MetaHuman’s mouth aligns with the audio.
Step 3: Adjust Phonemes and Visemes for Realism
Refine raw lip-sync data:
- Scrub syllables to confirm lip shapes match sounds (e.g., for “Hello”: slightly open for “H”, wide smile for “e”, narrow for “ll”, round for “o”). Adjust mouth blendshapes (e.g., “JawOpen”, “MouthWide”) in the Animation Curve Editor or Control Rig.
- Smooth viseme transitions for fluid motion, adding keyframes or adjusting curve tangents to avoid snapping or robotic movements.
- Handle overlapping visemes (coarticulation) by compromising shapes for natural flow (e.g., blending “sh” to “s” in “fish soup”).
- Test readability without audio or from a distance, ensuring clear mouth shapes. For automated tools like MetaHuman Animator or Audio2Face, tweak consonant closures (e.g., sharper “B”/“P”). Combine automation with manual adjustments for clarity.
Step 4: Sync Body and Head Movement (if applicable)
Add subtle head and body movements for realism, like nods or tilts for emphasis, using the neck or chest controllers in the Control Rig. Animate eye movements, including micro-darts or glances, and coordinate gestures (e.g., hand motions for “no way!”) in Sequencer with pre-made animations or body rig keyframes. This enhances the performance but may be skipped for face-only focus.
Step 5: Enhance with Facial Expressions and Emotions
Layer emotions to reflect the character’s feelings:
- Use eyebrows and eyelids for tone (e.g., raised brows for excitement, furrowed for anger), applied as additive tracks or keyframes in Sequencer, modulating over time.
- Add blinks every 3-5 seconds, manually or via MetaHuman Animator’s auto-generate option, at natural points like sentence ends. Include squints or widened eyes for emphasis.
- Tweak mouth corners and facial muscles (e.g., cheek raiser, lip pucker) for micro-expressions like smirks or pursing, aligning with dialogue subtext (e.g., frowning on “I’m not sure”).
- Animate in layers: jaw/mouth for lip-sync, brows/eyes for emotion, then blinks/details, ensuring emotional beats match dialogue.
Step 6: Review and Iterate
Review the animation in real-time and frame-by-frame, aligning audio and animation tracks precisely in Sequencer to avoid sync issues. Smooth out glitches (e.g., jittery lips) by editing keys or curves, ensuring no clipping or unnatural rig behavior (e.g., teeth/lip intersections). Seek feedback or revisit after a break to catch issues like slow blinks or prolonged mouth openings. Save the finalized animation for use in games or cinematics.

This workflow produces a MetaHuman that speaks convincingly with synchronized lip-sync and expressive emotions, with advanced techniques and troubleshooting to follow.

Yelzkizi how to create a metahuman from a photo: the ultimate guide to digital character creation — How to make a metahuman talk: ultimate guide to voice and facial animation

Advanced Techniques for Refining Lip Sync

Advanced techniques for refining MetaHuman lip-sync animation include:

Manual Viseme Refinement: Fine-tune visemes in the MetaHuman’s ARKit-based pose asset or animation curves, adjusting blendshape strength (e.g., enhancing “CH”/“TH” tongue position or reducing exaggerated shapes) for precise mouth shapes, guided by the MetaHuman Facial Description. Custom viseme poses can be created for complex expressions.
Blendshape Curve Editing: Use Unreal’s Animation Curve Editor to smooth jittery mocap curves, simplify noisy keys, and adjust lip lead/lag by shifting keys for anticipation (e.g., forming phonemes slightly before sounds) to enhance naturalness.
Facial Rig Constraints & Physics: Optionally add secondary motion like cheek/jowl jiggle for energetic speech using Control Rig’s spring solver, or animate throat movement for hyper-realistic setups, though these are subtle and often unnecessary for standard shots.
Emotional Layer Blending: Separate lip-sync and emotion into animation layers (e.g., base jaw/lip animation and additive smile/frown layers) using Sequencer’s blending or Control Rig’s additive tracks, allowing adjustable intensity (e.g., 50% vs. 100% smile) for flexible performance tuning.
Speech Timing & Pauses: Add realism with pauses or tempo changes by holding rest shapes between words or editing audio to insert silence. Emphasize key words by extending visemes or widening mouth shapes to mimic human speech cadence.
Optimize for Real-Time: For games/VR, bake facial animation to the Face AnimBP to disable continuous evaluation, reducing CPU load. Ensure LODs maintain lip-sync at lower levels by enabling Force LOD or Global LOD sync to prevent facial animation loss at distance.
Viseme Reduction for Performance: Simplify lip-sync for background NPCs by mapping multiple phonemes to fewer visemes (e.g., one open-mouth shape for all vowels), sacrificing realism for compute efficiency in large scenes.
Facial Animation Retargeting: Transfer animations between MetaHumans sharing the same rig, applying an animation asset to another character for time-saving reuse (e.g., crowd dialogue), with minor tweaks for facial feature differences like lip size.
Use of Tongue and Teeth: Keyframe the tongue for sounds like “Th” or rolled “R” and ensure teeth visibility on wide mouth shapes (e.g., shouts). Adjust jaw openings to prevent unnatural teeth separation or clipping from mocap overshoots.
Facial Hair and Accessories: Exaggerate lip or jaw movement for visibility under beards/mustaches and animate or use physics to keep long hair from obscuring the mouth during speech, ensuring clear lip-sync.

These techniques focus on fine control and optimization, requiring multiple iterations to elevate good animations to film-quality, lifelike results, even with high-quality motion capture.

Using MetaHuman with NVIDIA Omniverse Audio2Face (AI-Driven Lip Sync)

Using NVIDIA Omniverse Audio2Face for MetaHuman voice animation leverages AI to generate facial animation from audio, streamlining the lip-sync process. Audio2Face, part of NVIDIA’s Omniverse suite, analyzes speech to predict facial movements, outputting blendshape weights for a character’s face rig. It includes a sample head (“Oscar”), but for MetaHumans, the Audio2Face Unreal Engine plugin maps its blendshapes to MetaHuman’s ARKit-based rig, streaming data via Live Link.

The workflow includes:

Setup: Install NVIDIA Omniverse with Audio2Face and the Audio2Face Live Link plugin in Unreal, enabling it in the project with a prepared MetaHuman.
Configure Unreal: Set the MetaHuman’s Face AnimBP to use the “Audio2Face” Live Link subject instead of the default iPhone ARKit subject, and add a Live Link Skeletal Animation component to the MetaHuman blueprint.
Start Audio2Face: Open Audio2Face, load the sample face or a custom MetaHuman face, import the audio file, and generate or play the animation.
Connect Live Link: In Unreal, add an NVIDIA Omniverse Audio2Face source in the Live Link panel, matching the port (default 802) to Audio2Face’s streaming settings. Once connected, the MetaHuman’s face animates with the audio-driven data.
Adjust Settings: Use plugin sliders to correct timing mismatches between audio and animation, adjusting Animation Delay or Audio Delay, and ensure stereo audio is downmixed to mono if needed.
Fine-tune and Record: Record the streaming animation in Unreal using Take Recorder to create an Animation Sequence, allowing disconnection from Live Link for editing in the curve editor as needed.

Audio2Face delivers accurate lip-sync for basic phonemes with some default expressions, though it may struggle with fast speech or certain languages. It performs well for English, with features like confidence values for potential secondary motions (e.g., head nods). Users can refine results with MetaHuman Animator, Faceware, or manual Sequencer edits, such as keyframing mouth closures for minor inaccuracies. While real-time use requires a strong GPU, baking animations avoids runtime demands, making it practical for generating consistent lip-sync for extensive dialogue. This AI-driven approach reduces manual effort and aligns with evolving machine learning trends in character animation.

Troubleshooting Common Issues

Common issues when animating a MetaHuman’s speech and their fixes include:

Lip-Sync Mismatch (Out of Sync): If mouth movements don’t align with audio, verify that the animation and audio tracks start together in Sequencer. For Live Link, adjust delay settings (e.g., Audio2Face plugin) or offset the animation track by a few frames. Ensure consistent frame rates (e.g., 30fps recording matches 30fps playback) and use non-drop frame rates to avoid drift. For long cinematics, use timecode to sync audio and animation, as supported by MetaHuman Animator.
Mouth Not Moving At All: Check if the Face Animation Blueprint is receiving input. For Live Link, confirm an active subject (e.g., “iPhoneBlack” for ARKit, “Audio2Face”) in the Live Link window and ensure the MetaHuman blueprint references the correct subject name. Verify that animation assets are assigned to the MetaHuman in Sequencer and that keyframes are applied to the correct Control Rig, testing by scrubbing to confirm face movement.
Visemes Look Wrong (Pronunciation Issues): Incorrect mouth shapes (e.g., “O” for “A”) may stem from mis-assigned viseme pose assets. Use Epic’s ARKit mapping to avoid this, or for Audio2Face, apply the mh_arkit_mapping_pose_A2F asset. For plugins like OVR LipSync, calibrate viseme weightings (e.g., increase “Th” strength). Compare to real mouth shapes and adjust blendshapes (e.g., firmer lip closure for “B”) to match phonemes.
Facial Expressions Misaligned or Popping: Sudden expression changes (e.g., neutral to smile in one frame) require smoothing. Add transition frames or ease-in keyframes, and for mocap pops (e.g., Faceware tracking loss), manually smooth curves. When blending animations in Sequencer, overlap and crossfade clips to avoid snapping, using Unreal’s curve smoothing tool for interpolation.
Jaw or Head Movement Issues: Adjust exaggerated or minimal jaw motion in Sequencer by amplifying or reducing the jaw open curve for vowels. For unwanted head bobbling from Live Link (e.g., ARKit phone movement), disable head rotation in the Live Link Face app or Anim BP, or keyframe a stable head pose post-capture.
Frame Drops and Real-Time Performance: Laggy real-time animation suggests system overload. Record and bake animations instead of live solving for games, using precomputed assets. For live scenarios, reduce facial animation to 30fps updates or simplify blendshapes at distance using MetaHuman LODs. Profile the game to simplify the Anim Blueprint if needed, or use lower-quality MetaHumans for less critical scenes.
LOD Sync Issue (Face Stops Animating at Distance): If facial animation vanishes at lower LODs, enable Global LOD synchronization in the MetaHuman Blueprint to align body and facial mesh LODs, ensuring lips move even at LOD3. Force LOD0 for close-ups, reverting to auto afterward for performance.
Audio Issues Affecting Animation: Multi-speaker or echoey audio confuses solvers, causing muddled mouth movements. Use clean, single-speaker audio without reverberation, cleaning it if necessary to ensure accurate lip-sync input.
Uncanny Valley / Over-Animation: Creepy results from stiff faces or over-animated twitches require balance. Add subtle cues (blinks, brows) to avoid stiffness, but avoid excessive movement. Introduce slight noise or overshoot to curves for natural muscle variability, applied subtly to prevent glitches.

To troubleshoot, isolate issues by disabling animation layers or testing simpler methods (e.g., manual visemes) in an empty level to identify whether the problem lies in the animation, method, or game logic, applying the relevant fix accordingly.

Expert Tips and Best Practices

Expert tips and best practices for MetaHuman facial animation include:

Start with Strong Reference: Record the voice actor’s face or find similar reference footage to study unique facial ticks and pronunciations, or act out the line yourself in a mirror to ground the animation in reality.
Use the MetaHuman Face Control Board: Utilize Unreal’s face control board interface to intuitively adjust expressions and mouth positions with grouped sliders or puppet rig controls, simplifying tweaks without navigating numerous curves.
Record in Layers with Take Recorder: Capture multiple performance passes (e.g., one for broad mouth movement, another for emotional expressions) using Take Recorder, merging or referencing them to combine the best elements, avoiding the need for a single perfect take.
Iterate with The Audio On and Off: Animate with audio to ensure timing accuracy, then without to verify visemes convey speech clearly, a studio technique to validate lip-sync readability.
Don’t Neglect the Eyes: Animate eye saccades before new thoughts or clauses to mimic thinking, and add blinks (e.g., double-blink for hesitation, long blink for annoyance) to enhance storytelling through the eyes.
Blend Body Language: Incorporate body gestures, shrugs, or posture changes to complement speech, even for shoulders-up shots, using MetaHuman shoulder animations (e.g., slight lift for shrugs) to create a cohesive character performance.
Facial Animation Splining: After setting viseme keys, smooth interpolation in Unreal’s curve view by adjusting tangents for fast or slow movements (e.g., sharp for quick lip closure, flat for gradual), ensuring polished transitions.
Use High Quality Shaders for Final Render: Apply top-quality skin, teeth, and eye shaders for rendering, adjusting lighting to avoid odd mouth shadows and adding bounce light to highlight tongue and teeth for clearer speech visuals.
Optimize Workflow for Iteration: Streamline repetitive tasks for multiple dialogue lines by batch processing with Audio2Face or MetaHuman Animator (supports batch audio-to-animation in UE 5.5), creating Faceware actor profiles for better tracking, or using Unreal’s batch export/import for animations.
Collaboration and Feedback: Seek feedback from others to catch subtle sync or expression issues, or review objectively after a break if working solo, incorporating notes to refine the animation.
Continuous Learning: Keep up with MetaHuman documentation and community forums for updates (e.g., new MetaHuman Animator features), exploring guides and tutorials for hidden settings to enhance results.

These practices blend technical precision and artistic nuance, aligning with professional workflows to elevate MetaHuman talking animations’ quality and efficiency.

Yelzkizi how to create a sci-fi metahuman in unreal engine 5: a step-by-step guide — How to make a metahuman talk: ultimate guide to voice and facial animation

Case Studies: Real-World Applications of Talking MetaHumans

MetaHumans are utilized across various projects, demonstrating their versatility in facial animation:

“Blue Dot” Short Film (Epic Games & 3Lateral): This cinematic short showcased MetaHuman Animator’s capabilities, with a photorealistic MetaHuman delivering a monologue from a poem. Actor Radivoje Bukvić’s performance, captured via a head-mounted camera on a mocap stage, was converted into facial animation with minimal manual tweaks, achieving lifelike nuances and emotions. It highlights that real actor data paired with MetaHuman tools can produce high-quality results quickly, suitable for emotionally impactful cinematic work.
Video Game NPCs and Cutscenes (Multiple Studios): Smaller studios and indie projects use MetaHumans for triple-A quality characters, as seen in “The Matrix Awakens” demo, where MetaHuman tech populated a city and supported close-ups like Trinity’s (though custom-modeled). Community projects on Unreal forums and YouTube feature MetaHumans in short films or cutscenes, often animated with Faceware or iClone, such as a cyberpunk short in UE4 using Faceware for believable narrative dialogue.
Virtual Influencers and AI Avatars: Companies create virtual presenters using MetaHumans, like a GPT-3-powered demo combining microphone input, AI-generated responses, and text-to-speech with lip-sync for real-time conversation. While not Pixar-quality, it shows MetaHumans as interactive assistants or NPCs in games, with improving lip-sync for unscripted dialogue in live shows or customer service roles.
Film and TV Productions: MetaHumans aid VFX by prototyping digital doubles or pre-visualizing CGI characters speaking lines, offering speed over custom rigging. They’re also used for background digital doubles in scenes, saving time on creating extras with talking or reactive roles.
Live Events and Performances: MetaHumans serve as real-time hosts in broadcasts, like Unreal Engine showcases or a digital fashion show, where a backstage performer’s speech and movements drive the MetaHuman on screen. Dual capture setups ensure reliability, proving MetaHumans’ effectiveness in live scenarios with positive audience reception when executed well.

Lessons Learned and Best Use Cases:

For high-fidelity narrative (films/AAA games), capturing real actor performances with tools like MetaHuman Animator excels for close-ups and emotional scenes.
For large-scale or interactive content (open-world games, simulations), AI-driven tools like Audio2Face or Convai’s SDK enable efficient animation of many characters or dynamic dialogue.
Faceware ensures reliable production quality, ideal when using existing video reference.
Indie projects favor cost-effective tools like iClone or Audio2Face, as seen in JSFILMZ’s acclaimed project.
Real-time applications, like AI-driven or live event MetaHumans, leverage game tech and AI for robust unscripted scenarios.

These cases confirm MetaHumans’ practical use in delighting audiences and streamlining workflows, guiding creators to choose approaches, cinematic, interactive, or real-time, based on project goals.

FAQ: Common Questions About Making MetaHumans Talk

Do I need an iPhone or special hardware to animate a MetaHuman’s face?
No, an iPhone with TrueDepth camera and Live Link Face app is popular for capturing facial animation due to MetaHumans’ ARKit compatibility, but alternatives exist. You can manually keyframe in Unreal’s Control Rig, use a webcam or video with Faceware Studio, or employ AI tools like Audio2Face with an audio file and NVIDIA GPU, achieving great results without specialized hardware.
Can MetaHumans lip-sync automatically to any voice audio?
Yes, with tools like MetaHuman Animator (UE 5.2+), which generates facial animation from audio, NVIDIA’s Audio2Face for AI-driven lip-sync, or plugins like OVRLipSync for phoneme analysis. Unreal Engine requires setup or add-ons for automatic lip-sync, as it lacks a one-click solution without MetaHuman Animator (available in 5.5). Results are high-quality but may need animator polishing.
What are visemes and why are they important for MetaHuman speech animation?
Visemes are visual mouth shapes corresponding to speech sounds (phonemes), like top teeth on bottom lip for “F”/“V”. MetaHumans’ rig includes visemes (e.g., “mouthClosed”, “mouthOpen”) as blendshapes. Animating visemes in sync with audio phonemes creates lip-sync, simplifying animation by focusing on key shapes. Visemes are crucial for troubleshooting and ensuring convincing speech visuals.
How do I use Live Link Face with MetaHumans in Unreal Engine?
Using an iPhone (X or later) with the Live Link Face app, connect it to your PC on the same network and enable Unreal’s Live Link plugin. In the app, set the PC’s IP as the target. In Unreal’s Live Link window, add the iPhone as a source. Set the MetaHuman’s Blueprint or Anim BP to the app’s default subject name (“iPhone” or “iPhoneFaceSubject”). The MetaHuman mirrors facial movements in real-time, recordable via Take Recorder for playback or refinement, capturing ARKit blendshapes at up to 60fps.
Faceware vs. Live Link Face – which is better for MetaHuman facial animation?
Live Link Face is cost-effective (requiring only an iPhone) and easy to set up for quick real-time results, ideal for previews or final animations, but may lose tracking or miss subtle eye motions. Faceware, a paid professional tool, tracks video-based movements with machine learning, handling complex orientations and nuances with less cleanup, suitable for high-end film/games. Hobbyists find Live Link Face sufficient, while some use both, Live Link Face for drafts and Faceware for key moments.
Can MetaHumans talk in real-time in a game (during gameplay, not just cinematics)?
Yes, MetaHumans can talk in gameplay using pre-made animations triggered via Animation Montage or Sequencer for dialogues. Dynamic speech (e.g., unscripted lines) requires runtime lip-sync plugins like OVR LipSync. With LODs and limited on-screen MetaHumans, performance is manageable, as seen in Fortnite events. Precomputed animations ensure quality without heavy CPU/GPU load, though tools like Convai’s SDK enable cutting-edge dynamic speech.
The MetaHuman’s mouth sometimes stays a bit open or doesn’t close fully on certain sounds – how do I fix that?
If the “M/B/P” viseme doesn’t fully close the mouth, keyframe the jaw/lips to closed in the curve editor or Control Rig during silent gaps or relevant phonemes. Check the neutral pose for slightly parted lips and adjust if needed. For solvers or mocap, manually boost closures (e.g., soft “m”). Ensure no conflicting animation layers (e.g., half-smile) keep lips apart. Optionally, use Animation Blueprint to enforce closures for specific phonemes, slightly exaggerating for CG clarity.
How can I make the MetaHuman’s speech animation more expressive?
Enhance expressiveness by adding emotional facial expressions (e.g., smiles for happiness, furrowed brows for anger), natural eye blinks, and subtle eye darts or squints for emphasis. Incorporate head tilts, nods, or body gestures like shrugs, varying viseme timing and intensity to mimic human speech nuances (e.g., slurring or emphasizing syllables). Add secondary motions like cheek shakes for loud speech, using reference from charismatic speakers. Perform an expression pass for brows and eyes after lip-sync to make the performance feel like acting.
Is it possible to drive MetaHuman lip-sync from text (without recorded audio)?
Yes, indirectly by converting text to speech (TTS) audio and generating visemes. Plugins like “Text To LipSync” map text to phonemes and visemes, while AI services like Amazon Polly or Convai’s plugin provide phoneme timings for animation blueprints. TTS audio with viseme timing is simpler and includes playable sound, ideal for dynamic dialogues, though expressiveness depends on TTS quality and emotional markup.
Are MetaHumans and their facial animations suitable for use in films or is it only for games?
MetaHumans are suitable for film-quality work, designed for cinematic detail with realistic skin, eyes, and facial rigs for close-ups, as shown in the Blue Dot short film. Used in virtual production for TV/film, they serve as digital doubles or cinematic characters, rendered at high quality. Indie filmmakers use them for cost-effective shorts, and with proper animation, MetaHumans match film needs, though custom models may be needed for specific stylized or ultra-real faces.

Yelzkizi how to export metahuman to maya: a step-by-step guide for seamless integration and customization — How to make a metahuman talk: ultimate guide to voice and facial animation

Conclusion

Animating a MetaHuman to speak combines art and technology, enhancing storytelling through realistic voice and facial animation that fosters empathy and believability. MetaHumans provide the foundation, and various techniques enable this realism. The process spans Unreal Engine’s built-in tools, like Control Rig and Live Link, to advanced solutions such as Audio2Face and MetaHuman Animator. A practical approach might blend these, using Audio2Face for a quick draft, refining with Control Rig, and adding expressions via Live Link, treating the MetaHuman as an actor to perfect its performance.

For beginners, the complexity of advanced software or numerous facial controls may intimidate, but starting with simple phone-based Live Link capture offers instant results, scalable with step-by-step refinement (audio setup, initial animation, polishing). Each phase builds progressively, making the process approachable. Professionals can elevate fidelity using tools like Faceware for mocap, precise viseme adjustments, and subtle cues like blinks and head tilts, where animation and acting fundamentals remain key despite technological shortcuts.

The future promises further innovation with AI and real-time capabilities, shrinking the gap between concept and execution. Tasks once taking months are now achievable in minutes with AI, enabling rapid iteration and creative exploration, animating a dialogue scene overnight is now feasible. This craft merges audio production, animation skills, technical know-how, and performance, supported by accessible tools for all creators.

MetaHumans democratize digital character creation, and mastering their speech animation, whether for indie game cutscenes, short films, or AI-driven NPCs, unlocks their potential. This guide provides a clear path to experiment with methods, encouraging creators to enjoy the process. Effective talking MetaHumans enhance stories across games, films, and virtual worlds, deepening audience connection and pushing interactive storytelling forward.