A

Advait Balaji

Birla Institute of Technology and Science, Pilani - Goa Campus

ORCID: 0000-0001-9858-9578

Publishes on Genomics and Phylogenetic Studies, Gene expression and cancer classification, RNA and protein synthesis mechanisms. 33 papers and 645 citations.

33Publications
645Total Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

Current progress and open challenges for applying deep learning across the biosciences
Nicolae Sapoval, Amirali Aghazadeh, Michael Nute et al.|Nature Communications|2022
Cited by 371Open Access

Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.

Multiple genome alignment in the telomere-to-telomere assembly era
Bryce Kille, Advait Balaji, Fritz J. Sedlazeck et al.|Genome biology|2022
Cited by 52Open Access

With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.

EEG-based classification of bilingual unspoken speech using ANN
Cited by 38

The ability to interpret unspoken or imagined speech through electroencephalography (EEG) is of therapeutic interest for people suffering from speech disorders and `lockedin' syndrome. It is also useful for brain-computer interface (BCI) techniques not involving articulatory actions. Previous work has involved using particular words in one chosen language and training classifiers to distinguish between them. Such studies have reported accuracies of 40-60% and are not ideal for practical implementation. Furthermore, in today's multilingual society, classifiers trained in one language alone might not always have the desired effect. To address this, we present a novel approach to improve accuracy of the current model by combining bilingual interpretation and decision making. We collect data from 5 subjects with Hindi and English as primary and secondary languages respectively and ask them 20 `Yes'/`No' questions (`Haan'/`Na' in Hindi) in each language. We choose sensors present in regions important to both language processing and decision making. Data is preprocessed, and Principal Component Analysis (PCA) is carried out to reduce dimensionality. This is input to Support Vector Machine (SVM), Random Forest (RF), AdaBoost (AB), and Artificial Neural Networks (ANN) classifiers for prediction. Experimental results reveal best accuracy of 85.20% and 92.18% for decision and language classification respectively using ANN. Overall accuracy of bilingual speech classification is 75.38%.

SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning
Advait Balaji, Bryce Kille, Anthony D. Kappell et al.|Genome biology|2022
Cited by 33Open Access

The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
R. A. Leo Elworth, Qi Wang, Pavan K. Kota et al.|Nucleic Acids Research|2020
Cited by 30Open Access

As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.