Data science can ease the necessary burden of replicating large-scale studies

March 30, 2021
Data science can help make giant study groups comprehensible. (AP Photo/Mel Evans)

Data science can help make giant study groups comprehensible. (AP Photo/Mel Evans)

Researchers have found a way to leverage statistical modeling methods to simplify the vital but costly and time-consuming process of reproducing results from extensive meta-analysis studies, which can include millions of subjects.

They say their proposed technique, described in a paper published March 30 in Nature Communications, uses rigorous, objective and statistically sound techniques to indicate the strength of associations between study variables and validate the results without having to redo the whole work.

"There's a practical constraint that's limiting people's capability of replicating the study," said Dajiang Liu, an associate professor in the Department of Public Health Sciences at Pennsylvania State College of Medicine and lead author of the published work. "We basically figure out, potentially, a data science-based approach to evaluate the replicability without an extensive replication study." 

Replication is important because understanding the strength of associations between variables in a meta-analysis study can indicate whether the results should be used as foundation for future research. 

Without such knowledge, taking great time and care to do a study all over again could end up being a fruitless attempt. Additionally, because of the sheer magnitude of meta-analysis studies, sometimes it's nearly impossible to correctly replicate them, potentially leading to unvalidated results being available for citation.

"There's a reproducibility crisis in the medical field," Liu told The Academic Times. "If a signal happens to be irreproducible, and happens to be wrong, a lot of people who follow up your study could potentially waste their time because they're following something incorrect." 

Liu's team focused on genome-wide association meta-analysis studies, in which each participant's genome is mapped, then atypical sequences of DNA called single nucleotide polymorphisms are located. These are then connected to disease traits that are outwardly expressed by a person, called phenotypic traits.

"In the context of genome-wide association studies, the gold-standard approach is to find a replication sample to confirm the findings," Liu explained. "The replication sample would have to be independent of the discovery sample, but nowadays, the scale of genetic studies has grown too large." 

Right off the bat, these studies include millions of subjects. The new paper notes an example: Liu's study of smoking and drinking habits that included 1.2 million people and has been cited nearly 400 times.

"It's very hard to find another 1.2 million individuals to replicate the findings," Liu said, "so we actually innovate by developing an approach that could assess the replicability without a replication data set." 

The team's model takes advantage of the breadth of associations found in these consortium studies. Like the smoking and drinking study explored in the paper, consortium studies are unique in that they include data from several different cohorts around the world.

"A true signal would be consistently observed across different cohorts," Liu said. "In each cohort, it would be strongly associated with a phenotype of interest — in this particular case, smoking, drinking addiction." 

By collecting the available information from each cohort, Liu's team created a standardized rule that confirms which associations are strong, and which are questionable. 

"It can look at the consistency and the strength of the associations of signals at different study sites, to evaluate whether those signals are trustable," Liu explained. 

The researchers tested their method using the smoking and drinking study.

"Currently, the statistical approach confirms the validity of our results," Liu said. "We can say that has a 99% probability of being reproducible; it can offer magic like that."

Liu says his mechanism is the first to truly offer a standardized, meticulous model through which researchers can run their study results. Presently, such quality control is done rather leniently, he says.

"There's not a principled way, a statistically rigorous way of doing this. There are some heuristics that people look at by eyeballing, basically," he said, warning that, "Those are very hard to generalize; you'd have results that are difficult to interpret."

Going forward, Liu hopes that applying his modeling system to all types of large-scale meta-analysis studies can drastically improve their efficiency and output. 

"I think it's going to be very important for people who want to use their results to follow up, design therapeutics or design prevention strategies," he said. "Our approach could be used in those contexts."

Liu also conveyed that COVID-19 studies seeking to understand how the coronavirus' genetic variants influence each person's reaction to infection could greatly benefit from this, as well. 

The paper, "Model-based assessment of replicability for genome-wide association meta-analysis," published March 30 in Nature Communications, was authored by Daniel McGuire, Yu Jiang, J. Dylan Weissenkampen, Scott Eckert, Scott Eckert, Lina Yang, Fang Chen, Arthur Berg, Bibo Jiang and Dajiang J. Liu, Pennsylvania State College of Medicine; Mengzhen Liu and Scott Vrieze, University of Minnesota; and Qunhua Li, Pennsylvania State University. 

We use cookies to improve your experience on our site and to show you relevant advertising.