Machine learning tools to expedite citation screening and risk of bias appraisal in systematic reviews: evaluations of Abstrackr and RobotReviewer

Session: 

Oral session: Innovative solutions to challenges of evidence production (3)

Date: 

Monday 17 September 2018 - 15:10 to 15:20

Location: 

All authors in correct order:

Gates A1, Vandermeer B1, Johnson C1, Hartling L1
1 Alberta Research Centre for Health Evidence, Department of Pediatrics, University of Alberta, Canada
Presenting author and contact person

Presenting author:

Lisa Hartling

Contact person:

Abstract text
Background: Abstrackr and RobotReviewer are emerging tools that semi-automate citation screening and 'Risk of bias' (RoB) appraisal in systematic reviews (SRs).
Objectives: We compared the reliability of Abstrackr’s predictions of relevant records and of RobotReviewer’s RoB judgments to human reviewer consensus.
Methods: We used a convenience sample of SRs completed at our centre. For Abstrackr, we selected four SRs that were heterogeneous with respect to search yield, topic, and screening complexity. We uploaded the records to Abstrackr and screened until a prediction of the relevance of the remaining records became available. We compared the predictions to human reviewer consensus and calculated precision, proportion missed, and workload savings. For RobotReviewer, we used 1180 trials from 10 SRs or methodological research projects that varied by topic. We compared RobotReviewer’s RoB judgments among six domains to human reviewer consensus and calculated reliability (Cohen’s kappa coefficient), sensitivity, and specificity.
Results: Abstrackr’s precision varied by screening task (median 27%, range 15% to 65%). Proportion missed was 0.1% for three of the SRs, and 6% for the final SR, accounting for a median 4% (range 0 to 12%) of records in the final reports. The workload savings were often large (median 27%, range 10% to 88%). RobotReviewer’s reliability (95% confidence interval (CI)) was moderate for random sequence generation (0.48 (0.43 to 0.53)), allocation concealment (0.45 (0.40 to 0.51)), and blinding of participants and personnel (0.42 (0.36 to 0.47)). Reliability (95% CI) was slight for blinding of outcome assessors (0.10 (0.05 to 0.14)), incomplete outcome data (0.14 (0.08 to 0.19)), and selective reporting (0.02 (-0.02 to 0.05)). Sensitivity and specificity (95% CI) ranged from 0.20 (0.18 to 0.23) to 0.76 (0.72 to 0.80) and from 0.61 (0.56 to 0.65) to 0.95 (0.93 to 0.96) across topics, respectively.
Conclusions: Abstrackr’s reliability and the workload savings varied by SR. Workload savings came at the expense of missing potentially relevant records. Compared to reliability between author groups, RobotReviewer’s reliability was similar for most domains. These promising tools should be tested on large samples of heterogeneous SRs to inform their practical utility and guidance for their use.
Patient or healthcare consumer involvement: None.

Relevance to patients and consumers: 

When reviewers cannot keep pace with the publication of new trial data, clinicians may rely on evidence that is out-of-date to inform healthcare decisions. Patients may therefore receive treatments that do not reflect current evidence. Following further development and testing, the incorporation of machine learning into standard systematic review processes will improve the efficiency of their production. If gains in efficiency can be appropriately balanced with methodological rigour, the result will be more up-to-date high-quality systematic reviews and, in turn, better healthcare for patients.