Monday, May 8, 2017

A Primer on Downloading Sequencing Data from MG-RAST & the SRA

One of the best set of resources we have for bioinformatics, and especially microbiome research, are the extensive and freely available DNA sequence archives. For the past few years, most studies have been (and in most cases required to) archiving their relevant sequence datasets so that they are freely available to the public and other researchers. This is becoming an increasingly valuable resource for data mining and meta-analyses now that we have about a decade of archiving behind us. Just as these  datasets can be highly valuable research tools, they can also be particularly difficult resources to download and prepare for analysis. I have been meaning to get to this for a while, so this week I want to go through an introduction to downloading these datasets. My goal is to equip you to easily get the sequence sets onto your own computer and start your own analysis.

The Sequence Read Archive (SRA)

One of the largest (if not the largest) sequence dataset archives available to the public is the United States National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). This sequence archive has years of DNA sequencing studies readily available, but getting the reads can be a little bit of a challenge. They do have instructions (and other tools for downloading) in their documentation, but to make things easier, we will go through it here while including some custom scripts that you can use.

An easy way to get SRA datasets using command line tools is downloading the data from their ftp (no worries if you don't know what that is; it's just a site to download data from). As long as you are downloading a small-ish dataset, the wget tool works great. A nice subroutine you can use is as follows.

DownloadFromSRA () {
 echo Processing SRA Accession Number "${line}"
 mkdir ./data/${Output}/"${line}"
 echo Looking for ${shorterLine} with ${shortLine}
 # Recursively download the contents of the 
 wget -r --no-parent -A "*"${shorterLine}/${shortLine}/${line}/
 mv ./${shorterLine}/${shortLine}/${line}/*/*.sra ./data/${Output}/"${line}"
 rm -r ./

export -f DownloadFromSRA

If you copy and paste this into your command line (Linux/Mac), you can just type the subroutine name "DownloadFromSRA", followed by the project ID that you want to use, and it will download all of the samples for you. If you are using a Mac, be sure to install wget using something like Homebrew (which I highly suggest for downloading tools in general). The files you get will be in the SRA format, so you have to remember to convert them to fastq format using their custom tools.

You don't have to be a superhero hacker to get DNA data from public archives.

The Metagenomics RAST Server (MG-RAST)

Although used less than the SRA, the Metagenomics RAST Server (MG-RAST) is another one of the major archives available for free public use. Although MG-RAST is a nice sequence repository, it is unfortunately more difficult to use than the SRA (for downloading sequences at least). The key to downloading MG-RAST data with command line tools is honestly complicated at first, and sort of hidden in the documentation. Again, to make things easier, we can use some custom scripts to make things happen.

The trick to getting the MG-RAST sequence files using a project ID is that you have to first download the project metadata, and then use the parsed metadata information to download the actual files (this is done in the second loop below. The actual URL to use with their API is also kind of confusing, but once you get it you are ready to go.

DownloadFromMGRAST () {
 echo Processing MG-RAST Accession Number "${line}"
 mkdir -p ./data/"${line}"
 # Download the raw information for the metagenomic run from MG-RAST
 wget -O ./data/"${line}"/tmpout.txt "${line}?verbosity=full"
 # Pasre the raw metagenome information for indv sample IDs
 sed 's/metagenome_id\"\:\"/\nmgm/g' ./data/"${line}"/tmpout.txt \
  | sed 's/\".*//' \
  | grep mgm \
  > ./data/"${line}"/SampleIDs.tsv
 # Get rid of the raw metagenome information now that we are done with it
 rm ./data/"${line}"/tmpout.txt
 # Now loop through all of the accession numbers from the metagenome library
 while read acc; do
  echo Loading MG-RAST Sample ID is "${acc}"
  # file=050.1 means the raw input that the author meant to archive
  wget -O ./data/"${line}"/"${acc}".fa "${acc}?file=050.1"
 done < ./data/"${line}"/SampleIDs.tsv
 # Get rid of the sample list file
 rm ./data/"${line}"/SampleIDs.tsv

export -f DownloadFromMGRAST

These files will be in the fasta format instead of the sra format you get from the SRA. Also note that this uses GNU sed, which is not installed on Mac computers by default (Mac has a different version of sed. I know, it's kind of annoying). So make sure that, if you are running this on a Mac, install GNU sed using Homebrew again.

To give it a try, copy and paste this subroutine into your command line, and then write the project ID, like below.

DownloadFromMGRAST 4843


So there you have it. A very brief introduction to downloading SRA and MG-RAST datasets, with an emphasis on providing you the tools to do it yourself. Go ahead and give it a try. Let me know how it works, and if you run into problems, feel free to reach out with questions. And of course, please let me know if you have any questions, comments, or concerns!

Finally, thanks for reading! If you are a frequent reader, you might have noticed that my posts have been less frequent lately. I apologize for that. This has been an eventful year, which is great in general but bad for keeping up with the blog. As usual, it means I have some other exciting projects going on, and I am excited to share those experiences on here later. So for now the posts will be less frequent, but I look forward to getting back in a more frequent writing groove in the near future.

1 comment:

  1. With the whole digital revolution, i usually argue that there should be a software engineer in every house. I myself am quite intrigued with programming and this was helpful.