1. Motivation
- current smart kiosk systems
- Mainly depend on speech and touch without any visual information
- There are limitations to the richness of the responses in LLM due to solely using speech input
- Operate in a passive manner, necesitating user initiation through touchscreen inputs
2. Research goal and issue
- Goal : Develop image-to-text conversion technology for active LLM model
- Issue
- The current face detection encounters challenges in identifying users
- Most image-to-text based models need huge computational resources
3. Approach
- user detection
- Current methods for detecting faces often overlook practical application such as identifying users who have a specific intended use
- Develop identifying user criteria using face detection methods
- image-to-text generation
- image captioning is the task of describing comprehensive image contents in words
- scene graph generation method which obtains the relationship between objects is more proper
- develop scene graph generation method for lightweight architecture



4. Result
- Face identification : Identify users by comparing with pre-registered face vectors using a pre-trained model
- Face Expression Recognition : Develop visual emotion recognition model with lightweight. Emotion detection performance is suboptimal when the user is in a side view
- Face Engagement : Engagement is essential for preprocessing to understand user emotions. Engagement is determined using the key points.
