Long time since I wanted to make a blog post about voice technologies! Here comes the time thanks to the opening talk I gave at the recent Voices workshop in Pretoria at CSIR (South Africa). Those who have been following my work since 2006 know how passionate I’m about IVR and related technologies, and I’ve spent the last 4 to 5 years developing expertise, and experimenting different types of services in different countries from India to Mali to Ghana etc. I will try to explain why IVR sounds so exciting to me, what are the major opportunities and but also major challenges, what is the state of the art, and what is, IMHO, the future.
For me, voice technologies (aka IVR Interactive Voice Response) are close to a magic bullet in the ICTD/M4D domain. There is no competing technology today that has the following characteristics:
- It can work on absolutely all phones
- It can work on all mobile networks
- It uses a very well-known functionality of the phone (dialing a number)
- It can be used in all languages of the world
- It is accessible to people who cannot read and write
SMS or USSD present some of the features, particularly related to the first 2 points, and a bit of the third (but only a bit, see for instance the World Bank Report Maximizing Mobile that mentions SMS usage rate in some countries), but can hardly cope with the last two. I’m not going to write here a detailed analysis of pro and cons of each mobile technology, I should write a dedicated blog post for that, but there is clearly no competing technology today in particular when it comes to providing direct services to illiterate populations in rural areas.
On the other end, obviously, if IVR was a full magic bullet, then there will be no debate. There are also a few challenges that are quite hard to tackle. I believe we can sort these challenges in 3 categories: User Interface design, Speech technologies, and Hosting/Deployment.
The first class of challenges and probably the most difficult one is the user interface design. Lots of people have tried using IVR for BOP services, and most of them have largely failed due to user interface/user interaction design. First of all, training is essential. People, in most cases, while they know how to dial a number, have always reached a human. Now they dial, and reach a voice which is not a human, and does not usually answers the normal greetings procedure. This is the first barrier. In my experience, training users is critical, and providing support material that users can use to remember the process helps a lot. For instance, the use of visuals, like a flyer with the call flow of the application (what are the options and the different steps – see the pic. aside) makes a huge difference.
Then navigation between menus and the selection of options among a long list is just a no-go. People without experience and exposure to technology have lots of problem navigating between menus. Having one or two menus with up to 4 short choices is really the maximum. The accessible interfaces are the simplest ones. Each new feature increases the drop-rate by an order of magnitude. In order to simplify the interface as much as possible, applications have to use as many meta-data as possible (information you know about the users and their tasks without asking them directly). An example to illustrate this: in agriculture I’ve seen lots of people trying to build IVR systems in which a farmer will select the crop he is interested in among e.g. 10 categories of 10 choices each. This is just impossible. What is more efficient is to have a registration phase, very simple, where the farmer can declare the crop(s) he is interested in and then when he calls only information related to this crop is provided. In case of extension service, one can refine even further offering only support on topics related to e.g. the period of the year (e.g. harvest related support, or planting etc.). Taking into account all these elements in the design of the service is an essential part of success.
Another dimension to take into account in the design is the profile of the targeted user. As i said earlier, voice technology is the only accessible option for illiterate people. However, illiteracy is not a binary status (being or not being illiterate) but is a continuum that starts with people that are not able to manage numbers till being fluent in reading and writing. Understanding the level of literacy of the targeted users is essential to design usable interface. For instance, some people are not able to associate the sound “FIVE” to the corresponding icon “5” on the keypad of their phone. In such case, the use of keypad as a way to navigate in the application is a no-go, and the design should either not have navigation option or use other technics such as speech recognition.
In some cases, people are perfectly familiar with numbers, and thus, it is possible to ask for a numerical entry on the keypad. For instance, you can ask someone to enter the number of liters of honey (s)he wants to sell. This will ease a lot the management of information. In some other cases, this is just not possible, and any attempt of this kind will miserably fail. There is currently a clear lack of literature in how to map literacy levels with type of ICT or interface/interaction that people can use. All what I’m saying here is based on my own experience, but I believe that some formal research is needed.
Finally, the last element, but not the least important is trust. Information is valuable if only it is trusted by the recipient. In my experience, I’ve found that the voice used in the application is critical for trust: sometimes the gender is critical (e.g. maternal health advice or cooking advice must come from a woman in some/most culture), sometimes the voice must be a local voice (i.e. a Bambara Malian voice is not a Bambara Senegalese voice) or a known voice (the voice of the usual extension worker, or a the voice of a community radio speaker).
The second class of challenges is related to speech technologies. Speech technologies is the branch of voice technologies dealing with speech synthesis (aka text-to-speech engine TTS) and speech recognition. As i said above, it is possible to build an IVR service in all languages of the world. However, this is made possible through the use of recorded audio prompts. One records all the prompts of the application and then arranges and uses them in the application using the application logic. This is good, but this is very limited. It is very difficult to make advanced services with dynamic data using pre-recorded audio. In the same way, without speech recognition, it is not possible to use e.g. anything else then keypad navigation, which is sometimes problematic as explained above. Speech technologies are really interesting for complex and more advanced software. However, while you can easily find such modules for French, English, Spanish and few other languages, you cannot find anything related to e.g. Wolof, Bambara or Swahili, not speaking about even smaller languages like e.g. Moore, Peul or Bomu. The key challenge here is about making high-quality low-cost modules. We have been exploring this area with my colleagues since a few years. One possible solution is to design modules that are not complete modules, but more simple ones for specific applications. We have e.g. developed TTS for Bomu and Bambara for a market price information service in Mali. The major problem here is the costs of such modules, but otherwise, our studies (see e.g. Voice Browsing Acceptance and Trust – VBAT project) have demonstrated that there is a relatively high acceptance of e.g. computer-generated content, and there is no significant issues related to trust compared to human-recorded audio prompts.
The third class of challenges is related to hosting and deployment. This is IMHO one of the highest barriers for larger adoption of IVR services. The infrastructure needed to deploy a service is expensive. There are some low-cost options, the most well-known one being Freedomfone, but such approach is quickly limited. Indeed, such solution does not allow concurrent calls, and lots of users would eventually get a busy signal, bringing usually high level of frustration! One cannot really setup at an individual level a hosting environment that could handle multiple concurrent phone calls associated with the same phone number. The only option in most cases requires operator hosting, or investment in e.g. a primary access (e.g. a T1 line) that is expensive. For a long term service, this is not a very big deal, but for a short-term service (e.g. an election monitoring platform) or for a cheap setup this is inappropriate. The situation may change in the future, and is already changing in a few countries where VoIP companies are now appearing and offering VoIP private PBX solutions at very low cost. If such solutions generalize, then the hosting issue will be solved. Today, given the quality of internet bandwidth in most developing countries, it is not possible to rely on a VoIP solution outside a given country, without experiencing terrible delays, and very poor quality, degrading the user experience.
On a more positive angle, compared to other technologies, voice applications have a well-know well adopted open standard for application: VoiceXML. The development of applications using VoiceXML ensures that services are independent of the underlying architecture, as far as the infrastructure offers a VoiceXML layer. One can use open source software (e.g. Asterisk+Voiceglue) or proprietary professional software (e.g. Voxeo Prophecy) without impacting the application and requiring a single line of adaptation.
In terms of business modeling, the implementation of specific business models is relatively similar to other technologies such as SMS. There are three major options:
- Normal Call rate: people are charged per minutes of airtime spent on the phone connected to the application
- Toll-free numbers: callers are not charged any airtime to call the number, but the costs go to the service provider. (NB: toll-free number can be simulated through the implementation of missed-call/call-back mechanism)
- Premium-rate numbers that cost more than normal rates, and the revenue is shared with the service provider
The big difference with e.g. SMS is that it is far easier to implement cross-operator services. Indeed, toll-free numbers or premium rates numbers are concepts coming from the telephony age and not the mobile age. Regulations wrt these special numbers were in place before mobile networks and are easily recognized by users (e.g. 800 numbers). So prefixes are usually well-established and it is easier to get numbers that are independent of the mobile operator. In the SMS or USSD world, shortcodes are valid only in a given operator, and a service provider must deal with each and every operator to deploy his application across networks. From the user perspective, most of these shortcodes are not meaningful in terms of costs. They can be free or premium, nobody would know before using them.
On the drawback side tough, the cost of airtime is usually more expensive than the cost of SMS (tough the time needed to speak 160 char. may be very small!). If you are interesting to learn more on this topic, I’ve written a while ago a post about Sustainability in ICTD.
Apart from these challenges, there are also features that are sometimes great sometimes very problematic. Let’s take the example of recording people voice in the field. This is a great thing when e.g. developing a citizen journalism service. One wants to capture voices in the field, and IVR is the only option to give you such a feature. But, in the case of e.g. anti-corruption or human right violation reporting service, using people’ voices may be very dangerous as voices can be used to identify people. In the same way, the fact that people could be heard using a service (due to the use of voice) may be harmful (e.g. in the case of health service). It is therefore, essential to take the specificities of the channel into account when evaluating its relevance for a given service. Other potential issues include the need to be connected to use the service (no offline mode) or the lack of functionality to store and reuse of content (compared to e.g. SMS).
In terms of future, I don’t see the importance of IVR decreasing in the near future (e.g. 5 years), as the primary issue is the capacities of the devices in the hands of people in rural areas, and this is not going to change a lot in few years. However, on a longer term, I’m guessing that more advanced phones and data service will be largely available in the next decade. This will open a new era of opportunities. First of all, while voice services may still be essential, we may move from phone-based communication service to data-based service and on-device applications like voice apps available on smartphones today. This will be a slight change. The major change will come with new paradigm in user interface and user interaction design. I’m convinced that interfaces that will mix icons and sound are the future for accessible interfaces for illiterate people. We are still quite far, and there is only a few research teams starting to investigate this, but I’m convinced that this area will take off.
To conclude, IVR is not by far the magic tech. that solves all issues. We all know that using IVR is always painful, and any other choices are always preferable. For people who have choices, it is clear that SMS, USSD, Web, smartphone apps etc. would be far more powerful and definitely a preferred option. For that reason, it is unlikely that an application would be only an IVR application, but would have other interfaces like a Web interface for users (e.g. the organization deploying the service) that can use other channels. I would never say that IVR is the best solution for everything, because this is definitely not the case, but it fits a specific hole and it addresses challenges that no other technologies are able to solve today. As I tried to say here, designing, building and delivering a service that has an IVR component is not easy, is full of challenges, is often relatively costly, but when it comes to delivering direct services to people at the base of the pyramid, living in rural areas, there is no other choice. There might be strategies to avoid deploying direct services to people that cannot use text-based technologies (see e.g. my previous post of ICT for small-holder farmers), but at the end of the day, if you come to the conclusion that this is what you need, then no choice!
PS: and if you need a (team of) consultant(s) to help you in this journey, on any aspect of IVR presented here, then come and talk to me 😉