SSG-LUGIA

SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

SSG-LUGIA combines several sequence based features to infer GIs using an unsupervised anomaly detection pipeline.

w : int default : 10000

Length of sliding window
dw : int default : 100

Step size of sliding window
karlin_mode : str {'normalized', 'original', 'raw'} default : 'normalized'
The mode of karlin feature computation
- If normalized $$2-mers = \frac{freq(XY)}{\sqrt{freq(X) \times freq(Y)}}$$
- If original $$2-mers = \frac{freq(XY)}{freq(X) \times freq(Y)}$$
- If raw $$2-mers = freq(X)$$
Here, $$X \in [A,T,C,G]$$ $$Y \in [A,T,C,G]$$
pca_dn : int [1,16] default : 2

Number of PCA components for dinucleotide features
pca_amino_acid : int [1,20] default : 2

Number of PCA components for amino acid features
pca_kmer4 : int [1,256] default : 2

Number of PCA components for 4-mer features
entropy_features : bool default : True

Using entropy features or not
contamination_model1 : float [0.0,0.99] default : 0.15

The probable amount of outliers in the first anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
support_fraction_model1 : float [0.0,0.99] default : 0.75

The proportion of points to be included in the support of the raw MCD estimate in the first anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
contamination_model2 : float [0.0,0.99] default : 0.05

The probable amount of outliers in the second anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
support_fraction_model2 : float [0.0,0.99] default : 0.9

The proportion of points to be included in the support of the raw MCD estimate in the second anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
median_filter_window_len : int default : 400

Window size used in median filtering.
min_island_len : int default : 10000

Minimum length of predicted genomic islands. Islands with lesser length will be discarded

We present 3 variants of SSG-LUGIA. SSG-LUGIA_F, SSG-LUGIA_P and SSG-LUGIA_R tries to optimize the F1-Score, Precision and Recall respectively. The parameter values used in these models are as follows:

Parameter	SSG-LUGIA_F	SSG-LUGIA_P	SSG-LUGIA_R
w	10000	10000	10000
dw	100	100	100
karlin_mode	'normalized'	'normalized'	'normalized'
pca_dn	2	2	2
pca_amino_acid	2	2	2
pca_kmer4	2	2	2
contamination_model1	0.15	0.075	0.2
support_fraction_model1	0.75	0.75	0.75
contamination_model2	0.05	0.075	0.25
support_fraction_model2	0.9	0.9	0.9
median_filter_window_len	400	400	400
min_island_len	10000	10000	10000

Users can create custom models using Python dictionaries containing the parameters.

custom_model = {}
custom_model['w'] = 15000
custom_model['dw'] = 250
custom_model['karlin_mode'] = "normalized"
custom_model['pca_dn'] = 2
custom_model['pca_amino_acid'] = 2
custom_model['pca_kmer4'] = 2
custom_model['entropy_features'] = True
custom_model['contamination_model1'] = 0.25
custom_model['support_fraction_model1'] = 0.75
custom_model['contamination_model2'] = 0.125
custom_model['support_fraction_model2'] = 0.9
custom_model['median_filter_window_len'] = 1000
custom_model['min_island_len'] = 4000

The custom models can be used by passing the as model_parameters

from main import SSG_LUGIA

SSG_LUGIA(sequence_fasta_file_path='sample_data/NC_003198.1.fasta',model_parameters=custom_model)

Such models can be stored as .json files and loaded for future use.

import json

with open('custom_model.json', 'w') as outfile:
    json.dump(custom_model, outfile)

The .json format models can be used by passing the path to the .json file

from main import SSG_LUGIA

SSG_LUGIA(sequence_fasta_file_path='sample_data/NC_003198.1.fasta',model_name='custom_model.json')

SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

SSG-LUGIA combines several sequence based features to infer GIs using an unsupervised anomaly detection pipeline.

w : int
default : 10000

Length of sliding window
dw : int
default : 100

Step size of sliding window
karlin_mode : str {'normalized', 'original', 'raw'}
default : 'normalized'
The mode of karlin feature computation
- If normalized $$2-mers = \frac{freq(XY)}{\sqrt{freq(X) \times freq(Y)}}$$
- If original $$2-mers = \frac{freq(XY)}{freq(X) \times freq(Y)}$$
- If raw $$2-mers = freq(X)$$
Here, $$X \in [A,T,C,G]$$ $$Y \in [A,T,C,G]$$
pca_dn : int [1,16]
default : 2

Number of PCA components for dinucleotide features
pca_amino_acid : int [1,20]
default : 2

Number of PCA components for amino acid features
pca_kmer4 : int [1,256]
default : 2

Number of PCA components for 4-mer features
entropy_features : bool
default : True

Using entropy features or not
contamination_model1 : float [0.0,0.99]
default : 0.15

The probable amount of outliers in the first anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
support_fraction_model1 : float [0.0,0.99]
default : 0.75

The proportion of points to be included in the support of the raw MCD estimate in the first anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
contamination_model2 : float [0.0,0.99]
default : 0.05

The probable amount of outliers in the second anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
support_fraction_model2 : float [0.0,0.99]
default : 0.9

The proportion of points to be included in the support of the raw MCD estimate in the second anomaly detection scan.
This is required by sklearn.covariance.EllipticEnvelope.
median_filter_window_len : int
default : 400

Window size used in median filtering.
min_island_len : int
default : 10000

Minimum length of predicted genomic islands. Islands with lesser length will be discarded

Parameter	SSG-LUGIA_F	SSG-LUGIA_P	SSG-LUGIA_R
w	10000	10000	10000
dw	100	100	100
karlin_mode	'normalized'	'normalized'	'normalized'
pca_dn	2	2	2
pca_amino_acid	2	2	2
pca_kmer4	2	2	2
contamination_model1	0.15	0.075	0.2
support_fraction_model1	0.75	0.75	0.75
contamination_model2	0.05	0.075	0.25
support_fraction_model2	0.9	0.9	0.9
median_filter_window_len	400	400	400
min_island_len	10000	10000	10000

Users can create custom models using Python dictionaries containing the parameters.

custom_model = {}
custom_model['w'] = 15000
custom_model['dw'] = 250
custom_model['karlin_mode'] = "normalized"
custom_model['pca_dn'] = 2
custom_model['pca_amino_acid'] = 2
custom_model['pca_kmer4'] = 2
custom_model['entropy_features'] = True
custom_model['contamination_model1'] = 0.25
custom_model['support_fraction_model1'] = 0.75
custom_model['contamination_model2'] = 0.125
custom_model['support_fraction_model2'] = 0.9
custom_model['median_filter_window_len'] = 1000
custom_model['min_island_len'] = 4000

The custom models can be used by passing the as model_parameters

from main import SSG_LUGIA

SSG_LUGIA(sequence_fasta_file_path='sample_data/NC_003198.1.fasta',model_parameters=custom_model)

Such models can be stored as .json files and loaded for future use.

import json

with open('custom_model.json', 'w') as outfile:
    json.dump(custom_model, outfile)

The .json format models can be used by passing the path to the .json file

from main import SSG_LUGIA

SSG_LUGIA(sequence_fasta_file_path='sample_data/NC_003198.1.fasta',model_name='custom_model.json')

SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

Model Parameters

Default Models

Creating Custom Models

SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

Model Parameters

Default Models

Creating Custom Models