CIS 422/522 Project 1:
Anti-IVR System

P1_spec.html, Version 1.1, September 28, 2010, A.Hornof

Deadlines

October 7. Initial ProjectPlan/SRS/SDS document is due at the start of class, as is a 10-minute (including setup time) presentation. Bring a printout of the document to class. Also, one student from each group will email a single PDF to the professor with the subject line "Group <your group name> Initial Project Plan" before class starts. If multiple documents arrive from different students, the PDF in the final email will be graded. The instructor will not merge any documents.

October 25, at 10 PM (Week 5): Project is due, including all source code files, documentation, and the final ProjectPlan/SRS/SDS. Follow the submission instructions.

October 26, in class: Each group will present a 10-minute (including setup time) presentation demoing and presenting their project, and reflecting on the software-engineering lessons learned from the project.

The Motivation

"Computer, please call United Airlines and let me know when you get to a human."

A battle for human respect is playing out in the IVR (interactive voice response) and touch-tone phone "tree" systems used by an increasing number of companies. Even your corner grocer is starting to put computer-based answering machines between you and the store, such that you need to listen to slow and seemingly endless recordings and voice prompts, and press buttons or speak to voice-recognition systems, before you are permitted to speak to the grocer. Wouldn’t it be great if you could just say to your cell phone “Call United Airlines and let me know when you get to a human” and have your cell phone do all of the negotiation with the computer-based auditory robots that you are forced to deal with, and let you know when a person has picked up the phone on the other end? Your cell phone (or some other intermediary system) could not only negotiate all of the voice and keypress prompts such as “Press or say ‘1’ for store hours, press or say ‘2’ for the store location...,” but could even turn an IVR system back on the company, with a prompt such as “To speak to your customer, please press or say ‘1’” or perhaps periodically asking “Hi, is this a person yet?” in a very natural sounding voice until a person responds with “yes”, and then notify you that a person is on the line.

This sort of jockeying for who makes whom wait at the start of a phone call is commonplace among politicians. Aides dial the phone and pick up the phones, and try to wait until the big shot on the other end picks up before saying "please hold for whoever" and passing the phone to the big shot. Years ago, I interned on Capitol Hill in Washington, D.C., and witnessed another intern get yelled at because she told the Congressman that the Secretary of State was on the line before he actually picked up on the other end, along the lines of “Was the secretary of state himself actually on the line when you transferred the call to the congressman? Congressman Clarence Long is a senior member of Congress and outranks the Secretary of State!" A particularly clumsy example of this jockeying was captured at the start of this prank phone call to Governor Sarah Palin.

Project 1 will be a prototype IVR-versus-IVR application. To my knowledge, such systems are not yet available, with one exception, Fonolo, which has two major limitations: (1) It is a call-back system that requires you to give your phone number to a company to use the system, (2) it is not open source and end-user modifiable. But I suspect that open source systems of this ilk will soon be built, perhaps even for use on end-user-programmable cell phones. When IVR-vs-IVR systems become commonly used, companies may adopt policies in which they refuse to talk to automated phone-based, while trying to insist that their customers are required talk to the same automated phone-based. The irony will, if nothing else, provide an interesting social commentary on the gradual transition to the social acceptance of robots, and how robots will be used in power struggles at every level of society, not only for imposing themselves as you try to talk to a person on the phone, but for many other unforeseen tasks. Software agents have been driving trains for decades and are now routinely used by military in the form of unstaffed vehicles, boats, and airplanes. The digital agents are coming, and people are slowly being trained to conform to the wishes of them, but the transition is so gradual that the public barely notices it. This project will hopefully raise your awareness of this fascinating transition while also engaging you with a number of emerging voice-over-IP, telephony, and IVR technologies in a project that, to my knowledge, is quite novel.

There has already been a backlash against IVR systems. A website called gethuman.com was created by Paul English to list all of the numbers that have to be pushed to get to a person at about 500 companies. And there has been a backlash against this backlash. The contents of websites along the lines of gethuman.com could be updated periodically and used to help guide the anti-IVR systems through the voice prompts. People who run call centers, not surprisingly, dislike Paul English. See this article and this article and the exchange at the end of this interview.

Technology has long been criticized as having dehumanizing effects. The use of IVR systems, and the possibility of creating new systems to bypass the IVR systems, as proposed here, is a harbinger (a thing that announces an approach) of the battles that will play out between humans with technology and humans with technology, with both sides believing, or at least arguing, that they each have peoples' best interests in mind.

Problem Statement

IVR (interactive voice response) and touch-tone phone "tree" systems used by an increasing number of companies often make it difficult for a person to call a store or company and talk to a real human being who works at that establishment. The systems often require a caller to spend a long time listening to slow voice prompts (“Press or say ‘1’ for store hours, press or say ‘2’ for the store location...,”) just to ask a simple question such as “When will you close on New Year’s Eve?” or “Is that train usually on time”, often a question that is not easily answered by the voice prompt system.

IVR systems are arguably dehumanizing and insulting in that they imply that the caller’s time is less important than that of the person being called, and that the caller is, at least for a long initial screening process, only worthy of talking to a computer, and not to a real person. One bypass that is sometimes installed to let callers to get to a real person more easily is to permit the caller to just press “0” or say “agent.” Quite often, though, such bypasses are specifically not made available. Sometimes, pressing “0” restarts the entire voice prompt menu system back at the very start, effectively training its customers to not try to talk to a real person, presumably generating more profits for the company while lessening the customer’s experience. The systems are an example of computers being used to degrade a human experience.

Proposed Solution

The solution is to use the power of computers to combat the power of computers. The same way that IVR systems are used to prevent a customer from getting to a human, an anti-IVR system could be used by a customer to get through an IVR system to a real human being. The anti-IVR system would run on a laptop, desktop, or smart phone; call a company; navigate through the IVR or phone tree system; respond to prompts as needed, all internally to the computer or phone and silently to the customer, up until the point that the system reached a real human; and then the system would come alive and connect the customer to the company’s representative who picked up the phone

A super intelligent anti-IVR system could navigate through multiple menus and phone trees, listening to prompts, parsing phrases, and providing responses as needed. A less sophisticated system could be periodically trained up on the most recent sets of prompts for each company’s phone tree (stored in a database of voice prompts). A base-level preliminary system might simply be turned on by the user at any point that the user is on a phone call and put on hold; the system would just prompt, over and over, with “Hello? Are you a real person?” <pause> “Are you really there, yes or no?” <pause> and wait until the system parses a good “yes” in one of the pauses before alerting the user.

Basic System Requirements

A proposed set of basic system requirements are as follows. Note that functional and non-functional requirements are mixed together here.

1. The user can make a phone call (such as to a landline) using the system as if it were just phone, to just call and talk to whomever or whatever system picks up the phone on the other end. [Skype can be used to call land lines if you create an account and add a small amount of money to the account.]

2. The user, such as if put on hold, can switch the phone into a prompting mode that runs a loop that prompts the called party with a pre-recorded question or message. The message should be prerecorded in a sound file using a common file format such as gsm (used by the Asterisk IVR system), mp3, wav, or aif. If compression is used, it should be a common compression algorithm.

3. The system will repeatedly play that prompt message, pausing afterwards for a fixed amount of time, listening for and recording the response of the party called. The user can mute and unmute this back and forth dialog as is proceeds.

4. The called party’s response to each prompt will be analyzed within a fixed amount of time (that will not exceed the duration of the response recording, plus one second). First-level analysis (required) will include determining if a particular touch tone was played. Second-level analysis (optional) will count the number of discrete words or sounds that were played, so that the system could potentially determine simply if a single word was uttered. Third-level analysis (optional) will use voice recognition to determine the words that were spoken.

5. The response analysis will decide if a person is on the other line. If it decides “no,” it will continue looping until one of the following occurs: (a) a preset amount of time has elapsed, (b) the call is disconnected, such as by the other party, or (c) the user terminates the call, such as with a “hang up” button.

6. An audio recording of every phone call will be saved, starting from the moment that the other party picks up or, if the system is invoked by the user switching over to the system, as soon as the switchover is made. As soon as this recording starts, the A-IVR system will state “Your call may be recorded for quality purposes” (taken from the United Airlines reservation IVR system). It seems appropriate to have this stated up front so that the party you are calling has the option of hanging up if they do not want their phone call to be recorded. (It would also seem useful to record the entire phone conversation in the event that the other party gives up-front permission for the A-IVR system to record the conversation, as seems implied by a statement such as “Your call may be recorded for quality purposes.”)

7. All remote party responses, and the results of the analysis of each response, will be saved to disk in a unique, separate file that is not to be overwritten in subsequent runs. (This will help with system validation and verification.) [The obvious technical solution here is to date and timestamp (including seconds) each recording, and to put these all into a separate subdirectory, perhaps with a new subdirectory created for each phone call.]

8. The system events and system decisions of every phone call will be saved, including with a timestamp of what occurred at what time, such that each system event can be lined up with what was happening in the audio recording of the phone call at that time. For example, if the system decides that a particular called-party response results in a decision that “yes” a human is on the line, it should be possible to link that decision with the sound file used to make that decision.

9. When the system determines that a human being is on the other end, a visual and audio alert is played, and the user’s system immediately goes into a mode in which the user can talk with the called party, regardless of whether the sound was muted during the ‘are you a human’ prompting.

10. The user can modify the pre-recorded on-hold prompt, such as by replacing a sound file in the operating system. Though the A-IVR system does not need to provide any facilities for creating such files.

11. Installation instructions should be provided, along with all source code and instructions for compilation, that could be followed by a computer programmer with an undergraduate degree in computer science, and used to install the system on a “fresh” machine within 30 minutes. No assistance should be necessary from a member of the project team to install the software.

12. Because the only Windows machines that are readily available in the CIS department run on Oracle VirtualBox, if you are submitting a system that requires Windows, the system should be thoroughly tested and demonstrated to work running on Windows XP on Oracle VirtualBox on a Macintosh.

13. A task-based “Quickstart” document should be provided that uses screenshots and examples to walk the user through the common tasks that the system would be used to accomplish. On a computer where the system is installed and started up, a user who is sufficiently technology savvy to use email and make web purchases but has no special training in computer science, should be able to study the quickstart and use the system to: (a) Call a company, (b) have the system prompt the called party with the ‘Are you a real person?’ prompt, (c) put the system into silent mode, (d) talk to the called party when a person finally picks up the phone, and (e) explain to someone else what the system is doing (in terms of the basic functionality, not the deeper explanation of signal processing and such).

The Mini Project Plan / SRS / SDS

The SRS is the Software Requirements Specification. The SDS is the Software Design Specification. The Mini Project Plan / SRS / SDS document is a small combination of elements that might appear in several different documents in a larger project: A proposal, a feasibility study, a project plan, a requirements statement, a specification and/or external design, and an architectural design overview. This document should convince management, a client, or an investor that this project is worth funding. The quality and content of the document will communicate the likelihood of success of the project if it were to be approved.

Your Mini Project Plan / SRS / SDS should include at least the following:

Final Project

The final project will be evaluated based on the Project Grading Criteria

Possible Technical Approaches

There are a number of technologies that could be used for this project. Note that the version that you will do for Project 1 is a prototype, a proof of concept. It will probably be easiest to build so that it runs on a laptop or desktop machine and not a cell phone. Perhaps the most promising technological approach would be to use the Skype API. You can have Skype call landlines or cell phones very inexpensively by putting a few dollars into your Skype account. See the following links:

Skype Developer Zone - Look at the Tools & SDKs. There are APIs that work with a number of different languages.
Skype4Java - Note the downloads page.
API Reference for Skype API
Skype API for Java (Japanese)
Skype4Py (Skype for Python)

These might also be useful:

Asterisk.org and Skype for Asterisk
Session Initiation Protocol (SIP) and the SIP Charter
GNU Telephony

Java Telephony API - This is probably not useful on its own. It seems to need a switch; that is, a connection to a physical land line telephone. It is also just a specification and not an implementation. Furthermore, it seems to have been completed before the wide usage of VoIP. However, it might be useful in tandem with the XTAPI Java Telephony APIImplementation.

Java Speech Recognition may be useful when you get to the point that you can listen for the operator’s voice response and want to do some speech recognition.

Possible systems that could be used for voice recognition:
http://julius.sourceforge.jp/en_index.php
Though this might require your code to interface with C code.
http://simon-listens.org (though possibly just an interface to Julius)
http://cmusphinx.sourceforge.net/wordpress/ (JSAPI seems to be available)

http://code.google.com/p/gethumandialer/
is a bare bones start of what we want, but it needs to be more intelligent, and interact with the phone tree.

The problem with fonolo is that the core software is not open source, and that it is a dial-back system. We want something that just does the dialing, waiting, and navigating internally in your system. Note that fonolo calls this “deep dialing”.

There is a fair amount of open source open source IVR software, for creating the systems. It would be really cool to use some of this code to bypass the systems. Asterisk seems to be a large open source systems for building IVR systems. http://www.asterisk.org/

Technical Challenges Students Have Identified

"Currently, Skype cannot handle multiple sound inputs or outputs. This prevents concurrent events such as streaming audio from a text to speech message from being played through the speaker as a user would expect and sent through microphone input for Skype to handle." (Lopez, Wolfe, Suhr, Rotert, 2010)

Nearly all teams have had difficulty providing a system that could be installed without assistance from the team members. Many of the problems could have been identified if groups had delivered their system to each other first, and each group tested the installation of the other group's system before submitting the system to the instructor for grading.

Terms

DTMF - dial tone multiple frequencies. I think these are the dial tone and touch tones used on land lines.
IVR - Interactive voice response.
VoIP - Voice over Internet Protocols. Skype is an example of a VoIP service.