xAPI & Alexa: My Voice Is My Passport

    
In the 1992 movie "Sneakers," Robert Redford’s character Marty uses a taped recording of an employee’s voice to break into an office building. All in an effort to clear his name. This almost-failed interaction adds phenomenal dramatic tension and is one of the first times I saw voice technology used in movie that wasn't science fiction.

Copyright Universal Studios
Thankfully, voice technology has come a long way since then (or has it). Products such as Amazon Echo, Google Home, Apple Siri, and Microsoft Cortana are being used by people on a daily basis for personal and work purposes.

So what does this have to do with learning?

We often hear informal learning is one of the hardest forms of learning to track and report on. Can Amazon Echo and similar voice technology products provide a frictionless way to capture learning experiences? Maybe.

I decided to explore voice technology after we started doing regular mentoring chats at Watershed. We call these chats tweekers because they happen every two weeks. What if, at the conclusion of a tweeker, I could have an Amazon Echo record who I had the tweeker with and what I found meaningful during the chat? What if I could have that information go into the Watershed product to be stored and reported on?

Understanding how Amazon Echo works

With that idea in mind, I started researching how an Echo works. Besides the hardware, it uses a number of Alexa technology components from Amazon. At a minimum, creating an app requires these two components: an Alexa skill and an Alexa server application. The skill receives commands through a voice user interface (VUI). The server application contains the logic for what to do with those commands.

After reading more of Amazon’s developer documentation, I learned that Alexa is great at: matching speech it hears to text it’s configured to listen for; handling multi-turn dialogs and prompting for required text; and pronouncing common words in English, UK English, and German. However, it isn’t great at transcribing unexpected speech to text or pronouncing uncommon words.

It’s also interesting that you don’t have to use an Amazon Echo to work with Alexa technology. With the developer tools Amazon provides, it’s possible to create an app that runs on your phone, the web, or even a Raspberry Pi. 

With Amazon's developer tools, you can create an app that runs on your phone, the web, or a Raspberry Pi.

The first prototype

So, after learning about all I can do with Alexa, I decided to create a prototype that can connect xAPI and send test statements into Watershed with a specific voice command.

This is the VUI I designed:

User: Alexa, start Watershed.
Alexa: Watershed is ready.
User: Send test statement.
Alexa: Sending test statement to Watershed.
Alexa: Statement saved. 

Let’s walk through some key concepts for how this works.

Alexa, start Watershed.

When you create a skill, you have to use an invocation name. This tells Alexa which skill to start and, subsequently, which commands to listen for.

Send test statement.

The commands an Alexa skill listens for are called utterances. When it recognizes an utterance, it sends an intent request to the server application. This is how the intent for sending a test xAPI statement is defined: 

SendTestStatementIntent send test statement
SendTestStatementIntent send test xapi statement
SendTestStatementIntent send test tin can statement

When someone says any of these utterances, the Alexa skill sends the intent request SendTestStatementIntent to the Alexa server application. It then runs this code:

Amazon Echo, xAPI Alexa & User Statements

That code does three things:

  1. Constructs a new statement using the TinCanJS library
  2. Attempts to send the statement to Watershed
  3. Instructs Alexa to tell the user the statement is being sent and what the result of sending the statement was

Here is a video of this first prototype in action.

 

The second prototype

After having success with a simple prototype, I moved on to create something more complicated that could potentially be useful in a learning context.

Like before, this started with a VUI:

User: Alexa, start tweeker.
Alexa: Who had the tweeker?
User: Geoff Alday and Mike Rustici about Tesla.
Alexa: Mike Rustici had a tweeker with Geoff Alday about Tesla.
Alexa: Statement saved.

This VUI is more conversational in nature. Let’s walk through the main concepts that are unique to this prototype.

Geoff Alday and Mike Rustici about Tesla.

There is a lot going on with this utterance. It includes words that will be different each time: name of the mentor, name of the mentee, and an optional topic. These are known as intent slots. If you’re familiar with programming, you can think of intent slots as variables. Intent slots can required or optional and Alexa will prompt a user for a value for any required slot that hasn’t been filled. Intent slots also have to be assigned a slot type.

The slots for the mentor and mentee first names are set to the AMAZON.US_FIRST_NAME slot type. Amazon provides a large number of slot types anyone can use. This slot type is populated with thousands of popular first names common in the United States. Given that’s where most of us at Watershed reside, it seemed like a good choice.

The slots for the mentor and mentee last names are set to a custom slot type that I defined. When you create a custom slot type, you have to populate it with example values. In this case I added all of the last names of people I work with at Watershed. Likewise, the optional tweeker topic slot uses a custom slot type. I populated it with common topics we might talk about during tweekers (e.g. sales, product, marketing, Tesla).

With all of that programmed and configured, it was time to test it out. 

 

So, yeah. It turns out, just like what often happens in real life, saying people’s names is hard for Alexa. Last names and uncommon first names are particularly difficult. There is a way to specify the pronunciation of expected slot values using Speech Synthesis Markup Language (SSML). This could potentially be helpful in this case.

Since Alexa isn’t great at taking unexpected speech and turning it into written text, saying a first name that wasn’t in its list of U.S. first names often caused Alexa to either respond with something completely wrong or just refuse to do anything at all. For example, Lizelle was interpreted as Mozelle. I remedied this by extending the first name slot types with values for our uncommon first names. There was a similar issue with the tweeker topic slot. I did more research and testing into ways to trick Alexa into accepting and transcribing unexpected words. The suggested approach involved filling custom slots with thousands of nonsense words and phrases. Despite what the Internet said, this didn’t work. Throw in people with different accents and oh my. Let the fun begin. 

Voice technology meets Amazon Echo and Alexa.

There is also the Abby/Abbey, Irvin/Ervin, and Jeff/Geoff problem. Names with different spellings that are pronounced the same. This was falsely remedied by extending the first name slot types with these different spellings for people who work at Watershed. It seems impossible to deduce which spelling a person wants to use from the way they speak a name. One approach would be to get the person using the app to login prior to recording the tweeker. This isn’t a complete solution given Alexa still wouldn’t know which name to use for the other person involved in the tweeker.


Related Reading: Alexa developers get 8 free voices to use in skills (via TechCrunch)


Last, there is the getting the correct name into its corresponding slot problem. Even though Alexa is great at knowing when to prompt users for required slot values, with names it’s hard to get it right. For example, David is a common first and last name. There were often cases where Alexa would simply put the wrong name in the wrong slot even though it thought it was correct. One approach to solve this problem would be to redesign the VUI to prompt for each name individually. That would increase the time it took to record a tweeker and honestly, sounds like a bad experience.

Overcoming bad data

All of these technical issues add up to one huge challenge with this prototype: bad data. While it was incredibly easy to get data into Watershed, adding bad data to drive insights around who is having tweekers and what they’re talking about is beyond problematic.

So, could voice technology be useful in this type of learning context? I think if the mentor and mentee identification problem could be solved it would definitely be useful.

There are tons of other learning uses cases we could explore with voice technology. Here are just a few:

  • You could get a refresher about topics you’ve previously learned about. Tracking those interactions could help a learning organization understand the topics people need to reacquaint themselves with most often.
  • What are the top learning courses this week?
  • Who are the most active learners this week?
  • Who has compliance certifications expiring this month?
  • What are the most watched videos this week?
  • What percentage of people have completed the sales training program?
  • Who are the top five people that completed sales training and have the highest sales?

These ideas are less problematic because using all the data we have in Watershed, we could easily feed content into Alexa so it knows what to listen for.


Want to see more?

If you’re interested in viewing the code for these two prototypes, I uploaded them to Watershed’s GitHub page:

Read our full collection of product posts!

[Editor's Note: This blog post was originally posted on May 31, 2017, and has been updated for comprehensiveness.]

Geoff Alday

About The Author

Geoff leads our design efforts for the Watershed product. He loves learning random things. He wants to believe Bigfoot is real.