← Back to CAIT EmoSphere Logo

EmoSphere App

Multimodal & Cultural-Aware Emotion Recognition

Consumer mobile app for real-time emotion detection. 8 emotion categories, 99 languages, 7 cultural profiles, 4 modalities (face, speech, voice and posture/gesture). All processing on-device — privacy first.

Try Live Demo →

Live Demo

Multimodal & cultural-aware emotion recognition — AI detection of emotions from face, speech, voice and posture/gesture.

Having issues? Open demo in a new tab →

4 Modalities
8 Emotions
7 Cultural Profiles
100% On-Device

Three Modalities

EmoSphere captures emotional signals across face, voice, and text for comprehensive understanding.

🧑

Face — Vision Transformer (ViT)

Real-time facial expression analysis using Vision Transformers. Detects micro-expressions and maps to 8 emotion categories with high accuracy across diverse demographics.

🎙

Voice — Wav2Vec2

Speech emotion recognition powered by Wav2Vec2. Analyzes prosody, pitch, energy, and temporal patterns to identify emotional states from audio signals.

📝

Text — DistilRoBERTa

Sentiment and emotion analysis from text input using DistilRoBERTa. Understands nuanced language, context, and emotional undertones in real-time.

8 Emotion Categories

A balanced emotion space designed for real-world consumer applications.

😄 Joy
😢 Sadness
😮 Surprise
😨 Fear
🤢 Disgust
😐 Neutral
Love
😌 Calm

Cultural Awareness

Emotion expression varies across cultures. EmoSphere adapts its recognition models to 7 cultural profiles.

Cultural Profile Region Expression Style Key Adaptation
WesternNorth America, Western EuropeExpressiveHigh facial weight
East AsianChina, Japan, KoreaRestrainedHigher voice/text weight
South AsianIndia, Southeast AsiaExpressive with nuanceBalanced modalities
Middle EasternArab states, IranContext-dependentText emphasis
AfricanSub-Saharan AfricaCommunity-orientedVoice prosody focus
Latin AmericanCentral & South AmericaHighly expressiveFace + voice priority
Neutral/GlobalCosmopolitan contextsMixedDefault balanced weights

Multimodal Fusion

🧑

Face — 45%

Visual modality carries the highest default weight, reflecting the richness of facial expression data in emotion recognition.

🎙

Voice — 35%

Audio features including prosody, pitch contour, and energy patterns provide strong emotional signals, especially for arousal.

📝

Text — 20%

Linguistic content provides semantic context and emotional nuance. Weights shift culturally where text expression dominates.

Adaptive Weighting

Default weights (45/35/20) adapt based on cultural profile, modality availability, and confidence scores. When a modality is unavailable or low-confidence, weights redistribute automatically to maintain accuracy.

🔒

Privacy First

All emotion processing happens entirely on-device. No audio, video, or text data ever leaves the user's phone. Zero cloud dependency for inference.

📱

Cross-Platform

Available on Android and iOS with a consistent experience. Built with React Native and on-device ML runtimes for native performance.

Real-Time Performance

Optimized for mobile hardware with quantized models and efficient inference pipelines. Sub-second emotion detection on modern devices.

Available on

Android
iOS
Web