Phase 3 (2007 Deadline)
Telephone-Based Speech Interfaces for Access to Information by
Roni Rosenfeld, Carnegie Mellon University (CMU)
Sarmad Hussain, National University of Computer and Emerging Sciences
Pakistani Funding (HEC): $ 60,000
US Funding: $ 125,000
Project Dates on US side: June 1, 2008 - January 31, 2011
Information access is an essential, yet often-overlooked tool for socioeconomic development. While literate and affluent members of society have many ways to obtain information, there are alarmingly few options for the relatively impoverished nonliterate majority. Print media are unusable due to literacy issues, television and radio are mostly noninteractive, and face-to-face training is expensive. Although computers can provide an interactive learning experience, they are not viable for a variety of reasons. Cell phones provide a mechanism for human-computer communication for automated, self-service information access, as well as a host of other automated services. However, limited expertise in speech technology, the dearth of computer-based local language resources, and the lack of targeted research towards speech interfaces for nonliterate users have meant that such interfaces have not been developed, much less evaluated. Dr. Rosenfeld and Dr. Hussain devised their project to take the first steps in this direction in Pakistan. In this project, they aimed to design, develop, and evaluate an actual information access system for health information in Pakistan. Through this research project, they investigated the use of speech interfaces in a field-deployed system and also developed a speech recognition engine that could be easily adapted to other domains. The project should have the additional benefit of building the R&D capacity of Pakistani universities in the field of speech technology and enabling wider dissemination of this capacity through the development of coursework, which would pave the way for the creation of similar capacities in multiple Pakistani languages.
- Developed and tested a speech-based, telephone-based automated dialog system in both Urdu and Sindhi for healthcare information access for low-literate community health workers
- Designed, collected an prepared an Urdu speech corpus consisting of 42 hours of speech from 82 speakers, completed transcribed and with a transliteration lexicon
- Constructed and released three Urdu acoustic models (male, female, both) using Carnegie Mellon University's Sphinx speech recognition system
- Developed a technique for cross-language pronunciation modeling that allows the rapid deployment of small vocabulary dialog systems in low-resource languages such as Sindhi, Balochi or any local dialects
- Provided direct training to eight students (four of them Pakistani nationals) and impacted more than 18 individuals by informal training and involvement in the project (13 of them Pakistani)
- Published eight papers in peer-reviewed international conferences
The project deliverables have been completed and the project has closed on both sides. This project was affected by financial, visa, and security-related challenges as well as issues related to Dr. Hussain’s 2010 departure from his university to take another job. Nevertheless, the project produced several positive outcomes, including the collection of a speech corpus, development of acoustic and language models and speech processing tools for public release, curriculum enhancement, facilitation of one student’s master’s thesis, and completion of one publication and eight conference presentations. Dr. Rosenfeld reports that he continues to send students to Pakistan to collaborate with his partners there, and their results have been impressive. The most recent development arising from this collaboration has been the release of Polly, a telephone-based system for reaching low-literate populations via a simple voice-based game, then providing them with development-related voice-based services. As of August 2012, Polly is in active use in Pakistan and has reached nearly 100,000 people. A brief video demonstrating the system is available through this link. Additional reports on the project from inception through completion are available through the links below.
Progress Report Summaries
Show all progress summaries | Hide progress summaries
2010 Show summary || Hide summary
During 2010, the complete speech corpus and other speech processing resources have been prepared for release under the Creative Commons License. The release items include speech corpus, acoustic models, language models, linguistic resources, speech recognition results, curriculum, MS thesis, speech processing tools, and publications. As the project will close in January 2011, a complete phonetic review and updation of all transcribed segments is being started. This will further refine the data and improve the results. To further leverage Urdu acoustic models, as well as commercially available well trained acoustic models in other source languages, the researchers developed a technique for cross-language pronunciation modeling for small vocabulary speech recognition in low resource target languages. In November 2010, the results of the project have been written into a paper entitled "An ASR System for Spontaneous Urdu Speech", submitted to the O-COCOSDA meeting in Nepal.
In 2010, there are three students on the Pakistan side and five students on the U.S. side participating in this project. One of the Pakistani students, Mr. Agha Ali Raza, became interested in pursuing further studies in Carnegie Mellon University, and was accepted to the PhD program in Carnegie Mellon. He has started his study in the U.S. in September 2010.
2009 Show summary || Hide summary
In addition to completing work on the health information access system, ongoing consultations will focus on strengthening Pakistani capabilities in speech technology research and development and providing advice on the collection and curation of Urdu speech and language resources. In this regard, the Pakistani partners on the project held a week-long training session in early June 2009 for HANDS Staff involved in designing the health information access system. Two linguists and a technical staff member of the Pakistani team also attended a phonetics and phonology course at NUCES to build their capabilities to transcribe and tag the speech data being acquired.
Because of the continuing security problems, all collaboration in 2009 was carried out through conference calls and e-mail as both sides worked on various aspects of the projects including construction of an Urdu digit recognizer, design of a speech database, collection of additional data in recording sessions with a total of 60 Urdu speakers in Lahore, submission of data to the speech recognition engine, and testing of the engine's performance with both read and spontaneous speech. Dr. Rosenfeld and a Pakistani PhD candidate involved in the project, Jahanzeb Sherwani, had a joint paper accepted for the IEEE/ACM International Conference on Information and Communication Technologies and Development in Doha in April 2009, and they and two other colleagues had another joint paper appear in the December 2009 issue of Information Technologies and International Development.
2008 Show summary || Hide summary
The project is specifically focused on applying speech recognition technology to create a dialogue system for health information access. In the summer of 2008 they began by conducting a usability study testing a baseline prototype system for health information access in Dadu, Sindh. This was done in collaboration with Health and Nutrition Development Society (HANDS), a local nongovernmental organization that is headquartered in Karachi and has regional offices in various parts of Sindh. The aim of the study was to test the use of a baseline spoken information access system by low literate community health workers, and to compare its use with traditional methods such as text-based brochures. They used Sindhi-language health brochures that HANDS had already created, and recorded a native Sindhi speaker reading the content aloud. They then designed a simple telephone-based system that would play back the “audiobook” on demand. Initial testing showed that while users preferred receiving information in spoken over written form, they could not easily absorb long passages of spoken material and needed the system to be more interactive. After modifications were made, further testing was using 23 community health workers in Umarkot as volunteer users. The initial testing has already highlighted interesting differences in how literate and nonliterate users process spoken information, and a paper on these findings has been submitted for publication.
The Pakistani side reportedly received its grant funds later in 2008, which slowed their research efforts. On the US side, Dr. Rosenfeld faced a complication because the first graduate student assisting him on the project, who was of Pakistani origin, was preparing to complete his degree and leave the university. Another graduate student not of Pakistani origin was recruited, and the two students made a very successful visit to Pakistan on the project in August 2008, during which they participated in the testing described above. However, in the wake of the Marriott bombing in September, the second student’s family no longer allowed him to travel to Pakistan. A third student was subsequently recruited but dropped out after the Mumbai attacks, so Dr. Rosenfeld was left to seek another student assistant. Another problem arose when the Pakistani partner, Dr. Hussain, was unable to obtain a US visa in time to make a planned visit to Carnegie Mellon in December 2008. Instead, the research teams have consulted by conference calls to make up for the lack of in-person visits.
Back to Pakistan-US Science and Technology Program Phase 3 Grants List