Background

The primary objective of comparative genomics is inferring attributes of an unknown gene by comparing it with well-studied genes. In this regard, the notion of gene orthology plays a pivotal role as the orthologous genes remain less diverged in the course of evolution. However, this task is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics (gene duplication, transfer, loss, introgression etc.), multiple paralogous genes, incomplete genome data, and for distantly related species where similarity is hard to recognize.

Motivation

Advances in this domain have been mostly constrained to developing databases of genes so that a BLAST search can be performed to estimate orthologous relationships or, to the other extreme, constructing a computationally expensive phylogenetic tree to identify the gene duplications and speciations. There are usually some parameters involved in these algorithms which are manually tuned and sets one method apart from the other. Machine Learning based approaches show greater promises, but the existing works are too constrained.

In this work our motivation is to investigate how much information about gene orthology can be determined from the gene sequence alone, using machine learning.

Methods

NORTH starts with breaking the input protein sequence of the gene into k-mers. It uses the k-mer frequencies as the features, and uses multinomial Naive Bayes classifier with the bag-of-words model to classify the input sequence into one of the predifined orthologous clusters. Finally, it reports whether the gene is a member of the predifined orthologous clusters or an outlier.

NORTH Server

NORTH is accessible as a web interface with a server side application. Users can submit raw protein sequences or FASTA files or UniProtIDs as input. Moreover, we provide the APIs to use our system programmatically.

The server can be accessed using the following button. We intend to continue updating the NORTH server application, with more and more new genomic data.

A step-by-step tutorial is available at open_in_new

NORTH Server

Desktop App

NORTH is also available as standalone, cross-platform desktop applications for the convenience of the researchers. The apps can be downloaded freely from the following links.

Source Codes

NORTH is available as open-source code at the following links.

Code for the Paper

Code for the Server App

Code for the Desktop App

Publication

If you use NORTH in your research, please cite the following paper
NORTH: a highly accurate and scalable Naive bayes based ORTHologous gene clustering algorithm Nabil Ibtehaz , Shafayat Ahmed , Bishwajit Saha, M. Sohel Rahman, Md. Shamsuzzoha Bayzid

Contact Us

If you have any issues, queries or suggestions related to the NORTH project, please contact the following persons.
Md. Shamsuzzoha Bayzid, Assistant Professor, CSE, BUET (shams_bayzid@cse.buet.ac.bd)
Nabil Ibtehaz, CSE, BUET (1017052037@grad.cse.buet.ac.bd)
Shafayat Ahmed, CSE, BUET (shafayat@ra.cse.buet.ac.bd)
Bishwajit Saha, CSE, BUET (1205043.bs@ugrad.cse.buet.ac.bd)
M. Sohel Rahman, Professor, CSE, BUET (msrahman@cse.buet.ac.bd)