TeraPCA To Help Resolve Vast Genetic Testing Data Storage Issue

TeraPCA To Help Resolve Genetic Testing Data Storage Issue

Marketplace for genetic testing is at its full pace, in recent years. The number of people using at-home DNA tests has doubled since 2017, most of them being in the U.S. Approximately 1 in 25 American adults know where their ancestors came from, thanks to companies such as 23andMe and AncestryDNA.

Since the tests become more popular, those companies are grappling with how to save each of the collecting data and how to process results fast. A new tool named TeraPCA, created by researchers at Purdue University is now available to assist. The results can be followed at the journal Bioinformatics.

Despite people’s many physical differences (depending on factors such as ethnicity, gender or lineage), some two humans are about 99 percent the exact same genetically. SNPs – single nucleotide polymorphisms are the most common form of genetic variation responsible for the 1% that makes us different.

In every 1000 nucleotides, one SNP can be found – which signifies the presence of 4 to 5 million SNPs in every person’s genome. That’s a good deal of information to keep track of for even one person, but doing the exact same for millions or tens

of thousands of people is a real challenge.

Most studies of population structure in human anatomy use a tool known as Principal Component Analysis (PCA), which examines a massive set of factors and reduces it into a smaller set that still contains the majority of the same information. The reduced set of variables, known as principal factors, are much simpler to analyze and interpret.

Typically, the data to be analyzed is stored in the system memory, but as datasets get larger, running PCA becomes infeasible due to researchers and the computation overhead will need to use external software. Storing information is not just costly and challenging, but includes privacy issues although for the genetic testing companies. The companies have a duty to safeguard tens of thousands of people’s personal and comprehensive health information, and storing it all can make them an attractive target for hackers.

As with other out-of-core algorithms, TeraPCA was designed to process data too big to fit on the most important memory of a computer at one time. It makes sense of datasets by reading chunks of it at a time.

“In 2017, I met a few people from the big genetic testing firms and I asked them what they’re doing to run PCA. They were using FlashPCA2, that is the industry standard, however, they weren’t pleased with how long it had been taking,” said Aritra Bose, a Ph.D. candidate in computer science at Purdue. “To conduct PCA on the genetic data of a thousand individuals as many SNPs with FlashPCA2 would take a few days. It may be carried out with TeraPCA in five or six hours.”

By creating approximations of the principal components, the new app cuts down on time. Rounding to four or even three places yields outcomes Bose said.

“People working in genetics do not need 16 digits of accuracy that will not help the professionals,” he said. “They want just three to four. If it’s possible to reduce it to that, then you are probably able to get your results pretty quickly.”

Timing for TeraPCA also was enhanced by making use of several threads of computation, called “multithreading.” There is A thread like a worker on an assembly line; the threads are hardworking workers when the process is the supervisor. The exact same dataset is relied on by those employees, however, they execute their stacks.

Today, most universities and massive businesses have multithreading architectures, but FlashPCA2 does not leverage it. For tasks such as analyzing genetic information, Bose thinks that is a missed chance.

“We thought we should build something that leverages the multithreading architecture which exists right now, and also our strategy scales really well,” he said. “TeraPCA scales linearly with the number of threads you have. FlashPCA2 does not do this, which means it would take very long to achieve your desired precision.”

Compared FlashPCA2, TeraPCA performs equally or better on a thread and better with multithreading. The code can be found at GitHub.

Preety
Perfection is her hobby, Reliability is a synonym, Editing is her passion, Excellence is her Goal, Tactfulness is in her genes, Yellow is her Fav color. Preety is the name of the Professional on whom entire BioTecNika relies when it comes to its website. A Gold Medalist in Biotech from SRM University, Chennai with a 9.9 CGPA ( was awarded the Gold Medal by Honorable Prime Minister of India Shri Narendra Modi , as seen in the pic ), She decided to join forces with BioTecNika to ensure India's largest BioSciences Portal expands its reach to every city in India. She has redesigned the new avatar of BioTecNika from scratch and heads the most dynamic, vibrant and well informed Online Team at Biotecnika Info Labs Pvt Ltd