The distant segments kernel and the support vector machine: an alignment-free method for HIV type 1 coreceptor usage prediction.
Sébastien Boisvert, Mario Marchand, François Laviolette, and Jacques Corbeil.
Robert Cedergren Bioinformatics Colloquium 2008 (Université de Montréal).

The distant segments kernel and the support vector machine : an alignment-free method for HIV type 1 coreceptor usage prediction

HIV type 1 infects human cells through the interactions between ligands and receptors. Accordingly, this retrovirus uses the CD4 receptor in conjunction with a chemokine receptor, to penetrate target cells. In vivo, the chemokine receptor is either CCR5 or CXCR4. Bioinformatic methods were described to predict the coreceptor usage but they all rely on sequence alignments, making any sequences with too many indels not processable. To cope with this drawback, we developped an alignment-free approach using string kernels and support vector machines. The SVM has strong theoretical support and is very robust to noise. We created a new string kernel, namely the distant segments kernel, and compared it to existing string kernels in the litterature, such as the local alignment kernel and the blended spectrum kernel.

We obtained, with the distant segments kernel, an accuracy (1-empirical risk) of 94.80% on a testing set of 1425 examples with a classifier trained on a set of 1425 examples. Our algorithm outperforms the current state-of-the-art method for this classification task. Out of the 1425 training examples, only 577 were used as support vectors by the support vector machine, which indicates that a large margin linear classifier exists in a large feature space. Our method allows the fast and accurate prediction of all allowed coreceptor usages, that are CCR5, CXCR4 and CCR5-and-CXCR4. We implemented a web server to perform automatic classification through the CGI interface. This web server is available at

Support vector machines and string kernels have broad applicability in bioinformatics, such as remote protein homology detection, gene finding, and clustering. Furthermore, kernels are not limited to bioinformatics, but can also be applied to many tasks in chemoinformatics, such as virtual screening trials.