Vinija's Notes • Nuro Presentation

Motivation

Amazon music’s biggest use case is when people use Alexa devices, such as echo, and ask Alexa to play them music. Hard Samples in music, such as “Alexa play good music”, is something we struggle with.
We leverage/finetune Amazon Alexa’s Foundation model to help with understanding intent and output correct API calls to our downstream recommender system models that can help generate these recommendations.

Mood-Based Requests: “Play something upbeat,” “I need a relaxing playlist.”
- “Play something upbeat” → API Call: GET /songs?mood=upbeat
- “I need a relaxing playlist” → API Call: GET /playlists?mood=relaxing
Genre-Based Requests: “Put on some classic rock,” “Can you play jazz music?”
- “Put on some classic rock” → API Call: GET /songs?genre=classic_rock
- “Can you play jazz music?” → API Call: GET /songs?genre=jazz

Data Collection:
- Gather labeled data: Assemble a dataset where each input represents a user’s music-related query, and the corresponding output is the appropriate API function
- Data augmentation: Increase the variety of your dataset by rephrasing music requests, introducing typical misspellings, or using diverse ways to request music.
- Fine-tuning Titan to emit the APIs

Quantitative Metrics:
- Accuracy: Calculate the percentage of user queries that the model classifies correctly into the intended API functions
- Precision, Recall, and F1-score: Especially important if there’s a class imbalance in the API functions. For instance, if users more frequently request to play by mood than to play the genre.
- Confusion Matrix: Understand which categories or intents are commonly misinterpreted.
Qualitative Analysis:
- User Testing: Engage a diverse group of users to interact with the model in a real-world setting. Gather feedback regarding its accuracy, relevance of music choices, and overall user satisfaction.
- Error Analysis: Manually review a subset of mis-classifications to identify common themes or patterns. This might reveal, for instance, that the model struggles with recognizing certain genres or artists.
Real-world Performance Metrics:
- Engagement Metrics: Monitor how often users engage with the music played. A decrease in skips or an increase in full song plays can be indicators of good recommendations.
- Retention Rate: Measure how often users return to use the recommendation feature. A higher return rate can indicate user satisfaction.
- Feedback Collection: Allow users to provide feedback directly (e.g., “this wasn’t what I was looking for”) and use this feedback to iteratively improve the model.
Guardrail metrics