Public Cloud Databases For Bioinformatics

Image representing Amazon Web Services as depi...

Image via CrunchBase

At some point in the evolution of every biopharmaceutical company, they’re faced with the problem of finding new targets to keep their pipeline fully loaded.  Usually, they’ve exhausted the existing set of targets that they were working on, and are trying to decide what’s next?

The answer to that question results from a lot of blood, sweat and tears, usually in the form of literature mining and sequence analysis workflows.  Rather than running the risk of someone sniffing your internet traffic or otherwise sussing out what you’re working on,  scientists often download PubMed, and GenBank, fill up several terabytes of disk space, load the data into databases, write a series of queries, and perhaps throw a web front end on it.  And of course, once you’ve gone through this exercise, you have to repeat it because the data is constantly updated.  Moreover, you’re now the de facto bioinformatics guy because you’re the one using up all of the servers, and disk space.  This is not a cheap proposition, and if you’re a startup company, it’s one of those necessary evils that drains your limited resources.

The folks at Amazon Web Services have come up with a different approach.  They download and keep the public databases up to date for you.  You pay for the storage required for the data, and your access is secured through Amazon’s Virtual Private Cloud technology.  You don’t have to buy servers, or storage area networks, you simply rent the disk space and servers that you need, for as long as you need it.

Once you’ve gone through the mining effort, you simply collect the results, and turn off the servers once you’re no longer using them.  So if your efforts included BLASTing a number of sequence databases, once you’ve finished, there’s no need to keep the servers around. If your needs expand, then you can simply add servers as you need them.  No capital costs involved, no long boring budget meetings to sit in.

What databases are available?

  • PubMed
  • PubChem/PubChem 3D
  • UGI Virtual Conformer Library – handy for virtual screening
  • UniGene
  • GenBank (both FASTA and MySQL)
  • Ensembl (both FASTA and MySQL)

To find out more, visit Amazon Web Services here.


About Mark Fortner

I write software for scientists doing drug discovery and cancer research. I'm interested in Design Thinking, Agile Software Development, Web Components, Java, Javascript, Groovy, Grails, MongoDB, Firebase, microservices, the Semantic Web Drug Discovery and Cancer Biology.
This entry was posted in Bioinformatics, Informatics and tagged , , . Bookmark the permalink.

3 Responses to Public Cloud Databases For Bioinformatics

  1. Hi!
    You could also include Bio4j ( in that list 😉


    • aspenbio says:

      Hi Pablo,
      While I agree that bio4j is an interesting database, I don’t see it listed in AWS Public Data Sets list.


      • Hi Mark,

        I’m glad you find it interesting.
        You’re right it’s not listed in AWS Public Data Sets, thanks for pointing it out.
        I just completed a submission form so that they can include it if they find it appropriate. I’ll let you know if they do 😉


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s