Using Fuzzy Equivalence Relations to Model Position Specificity in Sequence Kernels

Abstract

One of the most fundamental issues in computational biology is the classification of biological sequences, particularly nucleotide (DNA and RNA) and amino acid (protein) sequences. Sequence statistics were widely used in the early days of computational biology. Hidden Markov Models (HMM) and Artificial Neural Networks (ANN) were used to improve these methods subsequently (ANN). SVM classifiers are essentially linear classifiers in their most basic form. SVMs differ from other linear classifiers such as perceptrons or logistic regressors in that they maximize the margin between the classes to define the classification function – a principle that is known to be optimal in terms of generalization error limitations. This paper replicates previously published sequence kernels based on the occurrence of specified patterns as well as some position-specific variants. We use fuzzy equivalence relations to model position specificity to develop a generalization of position-specific sequence kernels. This is not just a fascinating link, but it also leads to the creation of new kernels, such as the Elin kernel. We discovered that these kernels enable an explicit representation, allowing for improved computing efficiency and feature extraction.

Introduction

One of the most fundamental issues in computational biology is the classification of biological sequences, particularly nucleotide (DNA and RNA) and amino acid (protein) sequences. Sequence statistics were widely used in the early days of computational biology. Hidden Markov Models (HMM) and Artificial Neural Networks (ANN) were used to improve these methods subsequently (ANN). Support Vector Machines (SVM) for sequence classification have grown in popularity over the last decade (Scholkopf et al., 2004). SVMs have outperformed all rival methods in numerous tasks, including computational biology, but also many other disciplines. SVMs are now widely regarded as the most powerful class of classification methods, and this claim is justified (Durbin et al., 1998). SVMs have been effectively applied to a wide range of sequence classification tasks, from promoter and splice site detection (both on DNA data) to protein fold and secondary structure prediction (both on amino acid sequences; Ratsch and Sonnenburg, 2004; Ratsch and Sonnenburg, 2005; Meinicke et al., 2004).

SVM classifiers are essentially linear classifiers in their most basic form. SVMs differ from other linear classifiers such as perceptrons or logistic regressors in that they maximize the margin between the classes to define the classification function – a principle that is known to be optimal in terms of generalization error limitations (Cortes and Vapnik, 1986; Vapnik, 1998). Because linear SVMs only compute scalar products with input vectors, they enable the so-called kernel trick, which involves replacing scalar products with a kernel, which is a nonlinear two-place function that is positive semi-definite. Support vector machines can be used for nearly any type of data, including raw sequence data, signals, pictures, or graphs, thanks to the usage of kernels. All that is required is a suitable kernel. It’s no surprise, then, that one of the most important research areas in sequence classification with SVMs is the development of adequate sequence kernels. The majority of these sequence kernels are based on comparing sub-sequences and do not take into account the places of these occurrences. However, in many cases, the presence of a given sub-sequence is only significant if it occurs at a specific location or region.