How is the agreement between machine and humans? Use of RobotReviewer to evaluate the risk of bias of randomized trials

Session: 

Oral session: Innovative solutions to challenges of evidence production (2)

Date: 

Sunday 16 September 2018 - 16:00 to 16:20

Location: 

All authors in correct order:

Armijo-Olivo S1, Craig R1, Campbell S2
1 Institute of Health Economics, Canada
2 University of Alberta, Canada
Presenting author and contact person

Presenting author:

Susan Armijo-Olivo

Contact person:

Abstract text
Background: Evidence from new technologies and treatments is growing, along with demands for evidence to inform policy decisions. So, it is anticipated that the need for knowledge synthesis products (i.e. Health Technology Assessments (HTAs) and systematic reviews (SRs)) will increase. Increased demands will create challenges to complete assessments in a timely manner. New technologies such as RobotReviewer, a semi-autonomous risk of bias (RoB) assessment tool, seek to decrease the time and resource burden to complete HTAs/SRs. However, current evidence to validate the existing software for use in the HTA/SR process is limited.
Objectives: To test the accuracy and agreement between RobotReviewer and RoB assessments generated by consensus among human reviewers.
Methods: We used a random sample of randomized controlled trials (RCTs). We compared consensus assessments between two reviewers with the RoB ratings generated by RobotReviewer. We assessed agreement between RobotReviewer, and human reviewers using weighted kappa (K). We assessed the accuracy of RobotReviewer by calculating the sensitivity and specificity.
Results: In total, 372 trials were included in this study. Inter-rater reliability on individual domains of the RoB tool ranged from K = -0.01 (95% CI -0.03 to 0.001; no agreement) for overall RoB, to K = 0.62 (95% CI 0.534 to 0.697; good agreement) for random sequence generation. The agreement was fair for allocation concealment (K = 0.41; 95% CI 0.31 to 0.51), slight for blinding of outcome assessment (K = 0.23; 95% CI 0.13 to 0.34), and poor for blinding of participants and personnel K = 0.06 (95% CI 0.002 to 0.1). We found that > 70% of quotes for the RoB judgments for blinding of participants and personnel (72.6%) and blinding of outcome assessment (70.4%) were irrelevant.
Conclusions: This is the first study to provide a thorough analysis of the usability of RobotReviewer. Agreement between RobotReviewer and human reviewers ranged from no agreement to good agreement. However, RobotReviewer selected a high percentage of irrelevant quotes in making RoB assessments. Use of Robotreviewer in isolation as a first or second reviewer is not recommended at this point.
Patient or health consumer involvement: It is hoped that the results help knowledge synthesis teams whether to use such a tool to speed up the process of knowledge synthesis.

Attachments: 

Relevance to patients and consumers: 

This study is targeted to systematic reviewers and health technology assessments teams. The results of this project will inform about the usefulness and accuracy of one of the software used to evaluate the risk of bias of randomized controlled trials in the context of knowledge synthesis. It is hoped that the results help knowledge synthesis team whether to use such a tool to speed up the process of knowledge synthesis