The application of deep learning methods in the analysis of livestock genomes (Preludium Bis)
Key investigators: Joanna Szyda, Krzysztof Kotlarz
Period of work: 2020-2024
Funded: The National Science Centre (NCN)
Objectives:
Deep learning (DL) is a sub-field of machine learning methodology, which has recently and rapidly been gaining importance in many fields of science. Originally, it has been developed mainly for image recognition, but nowadays it has also been increasingly used in other fields, including genomics. According to the Editorial view of the Nature Genetics (January 2019) journal, deep learning algorithms are “to revolutionize genome analysis”. Their applications range from gene expression analysis, through modeling of gene expression regulation, to proteomics. However, in livestock genomics, analyses involving deep learning remain very sparse. Therefore, the goal of our project is to introduce the application of deep learning algorithms into this field.
Applications:
In particular, our project is going to use deep learning for four different aspects of whole-genome DNA sequence analysis:
- The first, dimensionally the smallest, data set involves the classification of single nucleotide polymorphisms (SNPs) of four bulls genotyped by two technologies – next-generation sequencing and an oligonucleotide microarray, into true-positive and false-positive polymorphisms. False-positive polymorphisms are represented by SNPs, which have genotypes discordant between both genotyping technologies. The classification algorithm will identify DNA sequence features with the highest impact on generating a false-positive SNP.
- The second data set involves the classification of cows into mastitis-resistant and mastitis-prone individuals, using point (SNPs) and structural (copy number variants, CNV) types of mutations, identified based on whole-genome DNA next-generation sequences of 32 Polish Holstein-Friesian cows. The classification algorithm will identify SNPs and CNVs with the highest impact on mastitis resistance.
- The third data set will be used for a multilevel classification problem, in which SNPs and InDels identified from whole-genome DNA sequences will be used to assign individuals to breeds. For this purpose, we will use data from 1 000 Bull Genomes Project (run7), consisting of 3 103 individuals representing various cattle breeds. The classification algorithm will identify SNPs and InDels characteristics for the breeds.
- The fourth problem will use the same data set as described for problem one and apply recurrent neural networks in a context of SNP genotype imputation, i.e. prediction of the full set of SNP genotypes corresponding to the whole genome sequence, based on a subset of SNP genotypes identified by an oligonucleotide microarray.
1. The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
Introduction
A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras.
Materials and methods
Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100%of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs.
Results
The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.
2. An explainable deep learning classifier of bovine mastitis based on whole genome sequence data – circumventing the p>>>n problem
Introduction
The most serious drawback underlying the biological annotation of Whole Genome Sequence data is the p>>n problem, meaning that the number of polymorphic variants (p) is much larger than the number of available phenotypic records (n). Therefore, the major aim of the study was to propose a way to circumvent the problem by combining a LASSO logistic regression model with Deep Learning (DL). That was illustrated by a practical biological problem of classification of cows into mastitis-susceptible or mastitis-resistant, based on genotypes of Single Nucleotide Polymorphisms (SNPs) identified in their whole genome DNA sequences. The first part of the analysis, i.e., the bioinformatic pipeline, aims to estimate the set of SNPs that forms the input for the DL-based classification scheme. Then, the statistical pipeline is used for the selection of the single best-classifying model comprising its underlying neural network architecture, hyperparameters, and the subset of SNPs and cut-off values estimations. Finally, the biological pipeline is imposed on significant SNPs from the best classifying model to provide the data set relevant biological explanation consisting of genome annotation and enrichment analysis.
Materials and methods
Among several DL architectures proposed via optimisation of DL hyperparameters by using the Optuna software imposed on different SNP sub-sets defined by LASSO logistic regressions with different penalty values, the architecture with 204,642 SNPs was selected as the best one. This architecture was composed of 2 layers with respectively 7 and 46 units per layer as well as respective drop-out rates of 0.210 and 0.358.
Results
The classification of the test data set resulted in the AUC=0.750, accuracy of 0.650, the sensitivity of 0.600, and specificity of 0.700. was selected as the best model and thus proceeded to genomic and functional annotations. Significant SNPs were selected based on the SHapley Additive exPlanation values transformed to Z-scores of the standard normal distribution. These SNP were then genomically annotated to genes. As a final result, a single gene ontology term related to the biological process and thirteen GO terms related to the molecular function were significantly enriched in the gene set that corresponded to the significant SNPs.
3. Exploring the impact of sequence context on errors in SNP genotype calling with Whole Genome Sequencing data using AI-based AutoEncoder approach
Introduction
The rise of next-generation sequencing (NGS) has transformed genomics, enabling rapid and affordable analysis of large genomic datasets. Variant calling, a crucial step in NGS data analysis, identifies genetic variations like single nucleotide polymorphisms (SNPs) and structural variants (SVs). However, variant calling is prone to errors stemming from sequencing inaccuracies, misalignment to the reference genome, and algorithmic limitations.
This study focuses on improving SNP genotype classification, addressing the imbalance between correctly and incorrectly called SNPs. We propose an anomaly detection approach using autoencoder (AE) architecture, a type of neural network designed for unsupervised learning tasks. The AE learns to reconstruct input data, helping identify anomalies with high reconstruction errors.
Materials and methods
Whole genome sequences (WGS) and array-based SNP genotypes of twenty unrelated Holstein-Friesian cows were available for analysis. The data from the genotyping array was preprocessed. SNPs incorrectly called by the WGS processing pipeline were identified by comparing their genotypes with those obtained from the genotyping array. In particular, incorrect SNPs were defined as SNPs with mismatch(es) that involved at least one allele between genotypes reported on the array and genotypes reported by the SNP calling workflow described above. The following explanatory variables from a Variant Call Format (VCF): SNP genotype (CALL), sequencing depth at the SNP site (DP), and SNP genotype quality (GQ) expressed as the probability that the called genotype is the true genotype were considered as potential features for the reconstruction of the SNP genotypes in the autoencoder
Results
The residuals corresponding to all PCs were used to classify the SNPs as correctly called or as outliers characterised by high reconstruction error. The precision metric of the models’, visualised in Figure 5, varied from 48.21% (±1.56%) in the L model fitting 3 PCs to 59.92% (±0.59%) in the L model with 19 PCs, classified by iForest. From both algorithms (iForest and OCSVM) the most complex model L resulted in the highest precision.
Papers:
- Kotlarz, K., Mielczarek, M., Suchocki, T., Czech, B., Guldbrandtsen, B., and Szyda, J.
The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
Journal of Applied Genetics, 2020
Next project’s results soon!