Vaccine Information Resource Assistant (VIRA) Conversation Corpus – Data Sheet
Researchers are invited to contact the VIRA team to access its conversation corpus. Below, you will find background on the corpus and a link to submit a request.
VIRA chatbot was launched in July 2021 (7.21) by the International Vaccine Access Center at the Johns Hopkins Bloomberg School of Public Health, working with IBM Research. The project aimed to disseminate reliable guidance on COVID-19 vaccines to counteract misinformation being shared in digital channels.
IBM researchers have worked on this project with the JHU team since December 2020, analyzing common concerns and queries related to the vaccines, and different ways in which people express them. The system is designed to continually learn from new conversations, by detecting emerging concerns and learn new ways in which people express existing ones. The system does not generate texts on its own. Rather, all the information conveyed by the system has been curated and reviewed by experts in the field from Johns Hopkins University. Moreover, the team is working to continually update this information, according to latest data and guidelines.
VIRA the chatbot is made up of several components. Together, these functions enable VIRA to 'understand' the user's concern and respond with the most relevant answer in its database. VIRA's knowledge base covers over 180 distinct concerns, referred to as Key Points, which were expressed by real world users, and responses from vaccine scientists and public health specialists from Johns Hopkins. VIRA can recognize the many thousands of ways different people can express these key points, and it learns and improves based on your feedback.
Overall, the dataset covers 10 months of chats (7.21 – 5.22) and contains approximately 8k dialogs, covering 28k user inputs.
This work was made possible through in-kind engineering contributions from IBM, start-up funding from Bloomberg Philanthropies, donated marketing credits from Meta, and technical support from WhatsApp and Vonage to adapt VIRA to WhatsApp.
The data was collected by storing the dialogs and respective messages in a database. The provided data is a dump of all logs between 7.21-5.22. No external identifiable information (e.g., IP address, WhatsApp number) was stored.
The system does not collect data about users and users are informed that the system does not provide medical advice. Further, users are informed that the chats may be used for research, education, and to improve the system. To ensure that no identifiable data appears in the data set, all data that could potentially be names, locations, phone numbers, birthdates, emails, and addresses were scrubbed from the data set using automatic measures. The data set is being used to show how queries about COVID-19 vaccines changed during the pandemic and how the system responded. This approach can serve as a benchmark for methods that aim to extract emerging intents in chatbot logs.
The data is licensed through Community Data License Agreement.
Please send a request to access the conversation corpus.