EmoSphere App

Multimodal & Cultural-Aware Emotion Recognition

Consumer mobile app for real-time emotion detection. 8 emotion categories, 99 languages, 7 cultural profiles, 4 modalities (face, speech, voice and posture/gesture). All processing on-device — privacy first.

Try Live Demo →

Live Demo

Multimodal & cultural-aware emotion recognition — AI detection of emotions from face, speech, voice and posture/gesture.

Having issues? Open demo in a new tab →

Three Modalities

EmoSphere captures emotional signals across face, voice, and text for comprehensive understanding.

🧑

Face — Vision Transformer (ViT)

Real-time facial expression analysis using Vision Transformers. Detects micro-expressions and maps to 8 emotion categories with high accuracy across diverse demographics.

🎙

Voice — Wav2Vec2

Speech emotion recognition powered by Wav2Vec2. Analyzes prosody, pitch, energy, and temporal patterns to identify emotional states from audio signals.

📝

Text — DistilRoBERTa

Sentiment and emotion analysis from text input using DistilRoBERTa. Understands nuanced language, context, and emotional undertones in real-time.

8 Emotion Categories

A balanced emotion space designed for real-world consumer applications.

😄 Joy

😢 Sadness

😮 Surprise

😨 Fear

🤢 Disgust

😐 Neutral

❤ Love

😌 Calm

Cultural Awareness

Emotion expression varies across cultures. EmoSphere adapts its recognition models to 7 cultural profiles.

Cultural Profile	Region	Expression Style	Key Adaptation
Western	North America, Western Europe	Expressive	High facial weight
East Asian	China, Japan, Korea	Restrained	Higher voice/text weight
South Asian	India, Southeast Asia	Expressive with nuance	Balanced modalities
Middle Eastern	Arab states, Iran	Context-dependent	Text emphasis
African	Sub-Saharan Africa	Community-oriented	Voice prosody focus
Latin American	Central & South America	Highly expressive	Face + voice priority
Neutral/Global	Cosmopolitan contexts	Mixed	Default balanced weights

Multimodal Fusion

🧑

Face — 45%

Visual modality carries the highest default weight, reflecting the richness of facial expression data in emotion recognition.

🎙

Voice — 35%

Audio features including prosody, pitch contour, and energy patterns provide strong emotional signals, especially for arousal.

📝

Text — 20%

Linguistic content provides semantic context and emotional nuance. Weights shift culturally where text expression dominates.

Adaptive Weighting

Default weights (45/35/20) adapt based on cultural profile, modality availability, and confidence scores. When a modality is unavailable or low-confidence, weights redistribute automatically to maintain accuracy.

🔒

Privacy First

All emotion processing happens entirely on-device. No audio, video, or text data ever leaves the user's phone. Zero cloud dependency for inference.

📱

Cross-Platform

Available on Android and iOS with a consistent experience. Built with React Native and on-device ML runtimes for native performance.

⚡

Real-Time Performance

Optimized for mobile hardware with quantized models and efficient inference pipelines. Sub-second emotion detection on modern devices.

Available on

Android

iOS

Web