The application of deep learning methods in the analysis of livestock genomes (Preludium Bis)

Key investigators: Joanna Szyda, Krzysztof Kotlarz

Period of work: 2020-2024

Funded: The National Science Centre (NCN)

Objectives:

Deep learning (DL) is a sub-field of machine learning methodology, which has recently and rapidly been gaining importance in many fields of science. Originally, it has been developed mainly for image recognition, but nowadays it has also been increasingly used in other fields, including genomics. According to the Editorial view of the Nature Genetics (January 2019)  journal, deep learning algorithms are “to revolutionize genome analysis”. Their applications range from gene expression analysis, through modelling of gene expression regulation, to proteomics. However, in livestock genomics, analyses involving deep learning remain very sparse. Therefore, the goal of our project is to introduce the application of deep learning algorithms into this field.

Methods:

In particular, our project is going to use deep learning for four different aspects of whole-genome DNA sequence analysis:

  • The first, dimensionally the smallest, data set involves the classification of single nucleotide polymorphisms (SNPs) of four bulls genotyped by two technologies – next-generation sequencing and an oligonucleotide microarray, into true-positive and false-positive polymorphisms. False-positive polymorphisms are represented by SNPs, which have genotypes discordant between both genotyping technologies. The classification algorithm will identify DNA sequence features with the highest impact on generating a false-positive SNP.
  • The second data set involves the classification of cows into mastitis-resistant and mastitis-prone individuals, using point (SNPs) and structural (copy number variants, CNV) types of mutations, identified based on whole-genome DNA next-generation sequences of 32 Polish Holstein-Friesian cows. The classification algorithm will identify SNPs and CNVs with the highest impact on mastitis resistance.
  • The third data set will be used for a multilevel classification problem, in which SNPs and InDels identified from whole-genome DNA sequences will be used to assign individuals to breeds. For this purpose, we will use data from 1 000 Bull Genomes Project (run7), consisting of 3 103 individuals representing various cattle breeds. The classification algorithm will identify SNPs and InDels characteristics for the breeds.
  • The fourth problem will use the same data set as described for problem one and apply recurrent neural networks in a context of SNP genotype imputation, i.e. prediction of the full set of SNP genotypes corresponding to the whole genome sequence, based on a subset of SNP genotypes identified by an oligonucleotide microarray.

Results:

  1. Kotlarz, K., Mielczarek, M., Suchocki, T., Czech, B., Guldbrandtsen, B., and Szyda, J.
    The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines

    Journal of Applied Genetics, 2020

Next project’s results soon!