Tuesday, December 30, 2014

Technically Speaking: Converting Glimmer predict and gff3 Gene Annotations

Example screenshot of open reading frame
annotation within the Geneious program.
Predicting open reading frames within genomic sequences is probably one of the most basic yet important hallmarks of bioinformatics and sequencing analysis. This is the process by which, given an organism's genomic sequence or a section of that genomic sequence, we predict what sections of that genome are potential genes. At its most basic level, this can be done by looking for sequence regions between start and stop codons (sequence signals for the beginning and end of a gene). While there are many programs for predicting open reading frames, I often use the common Glimmer3 toolkit. This program works great overall, but one drawback is that it can sometimes be hard to visualize your open reading frames on your genome or genomic region (using Geneious or the Integrated Genomics Viewer) because it does not give you a '.gff3' formatted file, which is commonly used by these programs. In this technical post, I am going to focus on the file types you get from Glimmer3, I will explain the .gff3 file type, and I will leave you with a perl script to convert between the two.

As you are predicting your open reading frames with a glimmer workflow, you get a few different files with a few different formats. One of the main output file types you are going to get is the '.predict' file, in addition to the '.detail' file. The '.detail' file includes a lot of information about all of the predicted open reading frames, while the '.predict' file only includes the final open reading frame prediction information. The file may contain predictions from one or many genomes or genomic segments. The file itself can be broken into six parts which I outline below.

  • Header: The first line is the genome identification, which is the same ID that the sequence had in the fasta file. Under the header is a set of five columns.
  • Column 1: The name (ID) of the predicted open reading frame.
  • Column 2: The sequence base number of the first base in the open reading frame. In other words, the starting location.
  • Column 3: The sequence base number of the final base in the open reading frame (last base of the stop codon). In other words, the ending location.
  • Column 4: The reading frame position.
  • Column 5: The 'per-base raw' score of the predicted open reading frame.

An example of a '.predict' file is found below. This file contains various predicted open reading frames for three different genomic sequence fragments.

>1_
orf00001      522      689  +3     0.85
orf00004     5600     4083  -3     2.97
orf00006     8925     9050  +3     0.11
orf00007    10514     9444  -3     2.95
orf00008    10836    10961  +3     2.96
orf00016    16597    15113  -2     2.97
>2_
orf00001     4684       94  +2     2.91
orf00002      353      207  -3     1.19
orf00003      464      194  +2     2.92
orf00004      33        27  -3     1.11
>3_
orf00004     5600     4083  -3     2.97
orf00006     8925     9050  +3     0.11
orf00007    10514     9444  -3     2.95
orf00008    10836    10961  +3     2.96

Now this file format is fine for many downstream applications, but it won't get you very far in visualizing the ORFs on your sequences in programs including Geneious or the Integrated Genomics Viewer. A great standard file format to use to visualize predicted open reading frames on genomic sequences is the '.gff3' file format. This includes most of the same information as the '.predict' file, but it is formatted differently. This is a standard format, which means it will play nicely with more programs, and more people will be familiar with it.

The '.gff3' (or Generic Feature Format Version 3) file format is a common and robust format used to describe genomic features (primarily genes, predicted open reading frames, etc). Like the '.predict' format, '.ggf3' is broken into tab delimited columns, except '.gff3' does not use headers and consists of nine columns. The nine columns representations are found below.

  • Column 1: (Seqid) The genome or genomic fragment that the predicted open reading frame belongs to.
  • Column 2: (Source) The name of the algorithm, program, or workflow that generated the open reading frame.
  • Column 3: (Type) The type of the feature described by the line, which is a gene in our case.
  • Column 4: (Start) The sequence base number of the first base in the open reading frame. In other words, the starting location.
  • Column 5: (End) The sequence base number of the final base in the open reading frame (last base of the stop codon). In other words, the ending location.
  • Column 6: (Score) The score of the feature is poorly defined, but is often an e-value or other score associated with the feature. In my case, I used it to identify the predicted open reading frame, which worked well in my own downstream analyses, but you can use the 'per-base' raw score.
  • Column 7: (Strand) A plus or minus to identify which strand the open reading frame is a part of.
  • Column 8: (Phase) This number described the reading frame associated with the feature.
  • Column 9: (Attributes) A list of more attributes associated with the open reading frame feature.

An example of a '.gff3' file can be found below. This was generated using my perl script (described below) and is derived from the '.predict' file above. Also note that this file is only for the open reading frames associated with sequence 2, and I made the text smaller so the entire line fits on the page.

2    GLIMMER    gene    4684    94    orf00001    +    2    ID=orf00001; NOTE: Glimmer ORF prediction;
2    GLIMMER    gene    353    207    orf00002    -    3    ID=orf00002; NOTE: Glimmer ORF prediction;
2    GLIMMER    gene    464    194    orf00003    +    2    ID=orf00003; NOTE: Glimmer ORF prediction;
2    GLIMMER    gene    33    27    orf00004    -    3    ID=orf00004; NOTE: Glimmer ORF prediction;

To end this post, I am including the perl script I used for the conversion from '.predict' to ',gff3'. I am also storing it with the rest of my microbiome analysis tools on my Github account. To use the script, call the script itself with perl, followed by the input file, the sequence ID you want to extract information for, and the output file.

Using the script:
perl GlimmerPredict2Gff3.pl ~/TestIn.predict 2 ~/TestOut.gff3

#!/usr/local/bin/perl -w
# GlimmerPredict2Gff3.pl
# Geoffrey Hannigan
# Elizabeth Grice Lab
# University of Pennsylvania
# This script will take in a .predict file from glimmer and will convert it to gff3 format.

# Set use
use strict;
use warnings;

# Set files to scalar variables
my $usage = "Usage: perl $0 <INFILE> <CONTIGID> <OUTFILE>";
my $infile = shift or die $usage;
my $contigid = shift or die $usage;
my $outfile = shift or die $usage;
open(IN, "<$infile") || die "Unable to open $infile: $!";
open(OUT, ">$outfile") || die "Unable to write to $outfile: $!";

# Confirm the contig identification
print "Contig ID is $contigid.\n";

# Store flag value as zero
my $flag = 0;

while(my $line = <IN>) {
    # Once you hit the contig block of ORF interest, append to flag and get going!
    if ($flag==0) {
        if ($line =~ /\>$contigid\_/) {
            ++$flag;
            next;
        } else {
            next;
        }
    # Now that the flag is appended, deal with the ORF lines for the contig of interest
    } if ($flag==1) {
        if ($line =~ /\>/) {
            # Once you hit the end of the ORFs of interest, by hitting the next contig identifier, append the flag.
            ++$flag;
            next;
        } else {
            chomp $line;
            $line =~ s/\s+/\t/g;
            print OUT "$contigid\tGLIMMER\tgene\t";
            print OUT "$2\t$3\t$1\t$4\t$5\tID=$1\; NOTE\: Glimmer ORF prediction\;\n" if $line =~ /^(\S+)\t(\S+)\t(\S+)\t(\S)(\S)\t(\S+)/;
        }
    # Once the flag is appended for the last time, kill the loop. We are done here.
    } if ($flag==2) {
        last;
    }
}

#Close out files and print completion note to STDOUT
close(IN);
close(OUT);
print "Fin.\n";



Works Cited






No comments:

Post a Comment