IS Identification
The families in ISfinder
are defined using an initial manual BLAST analysis often followed by
reiterative BLAST analyses with the primary transposase sequence of
representative elements used as a query in a BLASTP (Altschul, et al., 1990) search of microbial genomes. Potential full-length Tpases are retained and that
with the lowest score then used as a query in a second BLASTP search. This is
continued until no new potential candidates are detected. The ClustalW multiple
alignment algorithm (Thompson, et al., 1994) is then used and
the results displayed using the Jalview alignment editor (Clamp, et al.,
2004) for assessment. The corresponding DNA together with 1000 base
pairs up- and down-stream is then extracted and examined manually for the IRs
or other typical features such as secondary structures and flanking DRs. This,
together with comparison of the DNA extremities of various elements, allows
identification of both ends of the collected elements. In cases where more than
a single IS copy is identified, BLASTN can be used to define the IS ends. Where
only a single copy is found, the ends can often be defined by identifying and
comparing with empty sites.
In a second step, we use the Markov Cluster Algorithm (MCL) (http://micans.org/mcl/) (Van Dongen, 2000, Enright, et al., 2002) to weigh the relationships between clusters
of ISs and to validate prior ISfinder classification of ISs into families and
subgroups (Siguier, et al., 2009). This is explained in detail in Siguier, et al. (2009) and is based on the parameters used in the MCL (Fig 1.5.1) in addition to characteristics such as the
specificity of target site duplications, the detailed sequence of the ends,
genetic organisation. It
should be understood that the distinction between families and subgroups can
evolve as the number of ISs in the database increases.
Several semi-automatic IS annotation
pipelines are now available. The interested reader is directed to three of
these: ISsaga (Varani, et al., 2011) which is now integrated into the ISfinder
platform (Siguier, et al., 2006), ISScan (Wagner, et al., 2007) and Oasis (Robinson, et al., 2012). At present, de novo prediction of ISs is not efficient and these pipelines all
employ the ISfinder database to function. While all three pipelines permit
identification of IS fragments as well as full length ISs, a certain level of
manual assessment is essential.
References :
- Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ
(1990) Basic local alignment search tool. J
Mol Biol 215: 403-410.
- Clamp M, Cuff J, Searle SM & Barton GJ (2004) The
Jalview Java alignment editor. Bioinformatics 20: 426-427.
- Enright AJ, Van Dongen S & Ouzounis CA (2002) An
efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 1575-1584.
- Robinson DG, Lee MC & Marx CJ (2012) OASIS: an
automated program for global investigation of bacterial and archaeal insertion
sequences. Nucleic Acids Res 40: e174.
- Siguier P, Gagnevin L & Chandler M (2009) The new
IS1595 family, its relation to IS1 and the frontier between insertion sequences
and transposons. Res Microbiol 160: 232-241.
- Siguier P, Perochon J, Lestrade L, Mahillon J &
Chandler M (2006) ISfinder: the reference centre for bacterial insertion
sequences. Nucleic Acids Res 34: D32-36.
- Thompson JD, Higgins DG & Gibson TJ (1994) CLUSTAL
W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680.
- Van Dongen S (2000) A cluster algorithm for graphs. Technical Report INS-R0010, National
Research Institute for Mathematics and Computer Science in the Netherlands.
Amsterdam.
- Varani A, Siguier P, Gourbeyre E, Charneau V &
Chandler M (2011) ISsaga is an ensemble of web-based methods for high throughput
identification and semi-automatic annotation of insertion sequences in
prokaryotic genomes. Genome Biol 12: R30.
- Wagner
A, Lewis C & Bichsel M (2007) A survey of bacterial insertion sequences
using IScan. Nucleic Acids Research 35: 5284-5293.