Researchers Create Public Database of Nearly 40,000 Control Samples for Genetic Association Studies

Researchers Create Public Database of Nearly 40,000 Control Samples for Genetic Association Studies 1024 575 Lauren Dembeck
conceptual art of DNA

Genetic association studies provide powerful means of discovering relationships between genetic variants and disease risk. However, these studies require enrolling very large numbers of individuals with the disease (cases) and healthy individuals (controls) as well as collecting genetic data for all of those participants — a time-consuming and expensive endeavor.

“Sequencing DNA for a healthy control group in a case-control study potentially takes half of the study budget. And such control data is generated independently by many research groups. This effort is needed because we are unable to freely share control data between research groups,” explains Mykyta Artomov, PhD, principal investigator at the Steve and Cindy Rasmussen Institute for Genomic Medicine at Nationwide Children’s Hospital. “If we can share a large, pooled set of control samples, then we can spend more of our budgets sequencing more individuals with the disease, increasing the power of the study to detect disease-causing genetic variants.”

While sharing data from control participants sounds convenient and worthwhile, human genetic data are private health information, meaning researchers, even within the same institution, cannot share these data without abiding by strict consent and privacy regulations.

There are well-established repositories for sharing genetic data, such as the National Institutes of Health database of Genotypes and Phenotypes (dbGaP), explains Dr. Artomov. However, to access such data, researchers must undergo an application process and still need considerable time and funds to process the data, which are often derived from numerous sources and are distributed in a raw unprocessed format.

To overcome these challenges, Dr. Artomov and his colleagues — Alexander Loboda, Maxim N. Artyomov and Mark J. Daly — have created an online database of 39,472 sequencing control samples compliant with personal data protection regulations. This public platform eliminates the need for sharing of individual-level data and will serve as a valuable resource enabling association studies for case cohorts lacking controls.

The study detailing the development of the platform was recently published in Nature Genetics. It describes the extensive process the researchers used to evaluate the performance and robustness of their approach using large-scale genetic data from multiple technical platforms, major continental and fine-scale ancestry groups.

The algorithm that underlies the platform enables selection of the best-matching control samples from an external pool of samples. In under 10 minutes, the system selects control samples using pre-specified matching accuracy, eliminating the need for data processing by the user and ensuring an unbiased dataset.

Dr. Artomov and colleagues hope their technology will open the door for more public sharing of valuable control datasets, which would not only increase the number of samples but would also increase the diversity of overall control population available in the database.

“In countries where genetic data is strictly regulated and not permitted to be included in any public repository, it enables researchers to share their data more broadly; they can use our system to make their data available without making any security compromises, giving the research community access to data from more diverse populations,” adds Dr. Artomov.

“We’re making it much faster and cheaper for researchers to make biological discoveries of new disease-associated genes and causative variants by putting valuable control data that may have previously only been used once, or only by a single research group, to continue to be put to good use,” says Dr. Artomov.


Artomov M, Loboda AA, Artyomov MN, Daly MJ. Public platform with 39,472 exome control samples enables association studies without genotype sharing. Nature Genetics. 2024.

Image credit: Adobe Stock


About the author

Lauren Dembeck, PhD, is a freelance science and medical writer based in New York City. She completed her BS in biology and BA in foreign languages at West Virginia University. Dr. Dembeck studied the genetic basis of natural variation in complex traits for her doctorate in genetics at North Carolina State University. She then conducted postdoctoral research on the formation and regulation of neuronal circuits at the Okinawa Institute of Science and Technology in Japan.