A New Year, A New Site, A New Service

It’s the start of a new year, and nothing says New Year like a website refresh. With the rise in the number of visitors to the site using mobile browsers, we’ve updated the site to make it more mobile friendly. Not only is it more easily viewable on smartphones and tablets, but you can add it to your home screen just like any other app.


That last bit is important because we’re also announcing the launch of a new service simply called Aspen Gene. With it you can look up information on any gene. The service is powered by the MyGene.info web service developed by Chunlei Wu at The Scripps Research Institute’s Su Lab.  To visualize the data, we’ve developed a series of web components and are making them available through the new open source BioPolymer project.


The Aspen Gene Search interface

Let’s take a look at the service. You start by entering in the symbol for a gene of interest, in this KRAS. Then tap the “Search” button to start the search.  The search results will then appear as a series of cards at the bottom of the screen. Tap on the arrow icon on the result card, and the gene summary will appear as shown below.

Gene Summary

The Summary tab provides an overview of the gene, including its symbol, synonyms, and IDs in related databases.  We’re currently linking to NCBI’s EntrezGene database, the Online Mendelian Inheritance In Man (OMIM), the Human Genome Nomenclature Committee, UniGene, and PharmGKB.  You can tap on the icon to the right of the field to open the record in a new window.


Gene Summary Tab

Protein Information

The Protein tab shows the UniProt ID, along with a list of InterPro domains found in the protein.  You can tap on any of the domains to see more information.  The Protein Database section shows a list of PDB IDs. You can tap on any ID to display the associated protein structure.


Protein Information Tab

Pathway Information

The Pathways tab shows a list of all of the pathways that the gene participates in. This includes entries from KEGG, Reactome, PharmGKB, Wikipathways and more. You can tap on any pathway name to see a diagram of the pathway.


Pathway Information Tab



The Publications tab shows a list of GeneRIF publications. These are References Into Function, or papers that indicate the function of a gene, and are found in the NCBI EntrezGene database. Tap on the card to display the PubMed record for the article.


Publication Information Tab

So, come visit us, and give our new site (and our new service a try)!

Posted in Bioinformatics, Informatics, Science Blogging | Leave a comment

Drug Target Identification: And then a miracle occurs


At the front end of most drug discovery programs lies a step called Target Identification, and a few months ago I sat down with a colleague to discuss their approach to target identification.  In particular, “how do you characterize a target”? I was surprised at how much that process can vary from company to company.

As I set out to describe my workflow for this blog post, I was reminded of this cartoon, and how much work goes on between the starting point and the end point when researching the function of genes.

I should preface what I’m about to say, with the words “this is the way I work” your goals and tools might be different, and I’m always curious about the way people work.  So please feel free to comment.

At a macroscopic level (regardless of your ultimate research goals) there are three levels of research:

  1. Foundational Research: where you familiarize yourself with the general “landscape” of a particular research topic.
  2. Deep Dive Research: where you examine certain concepts exposed in step 1 in-depth.
  3. Current Research: where you create a “surveillance” program to keep yourself up-to-date with the latest developments in a particular area of research.

In the examples which follow, I’ll be showing you the steps that I take and the tools that I use to learn more about the target space for pancreatic cancer.

Foundational Research
My goal at this stage in the game is to answer the following questions:

  • What is the etiology of the disease? (What syndromes predispose people to the disease and what percentage of the patient population do they account for?)
  • How does the disease progress? (What are the clinical stages?)
  • What genes & proteins are involved in the progression of the disease?
  • What pathways & disease processes do they participate in?
  • What is the current standard of care, and what genes are targeted by that standard of care?
  • Who are the thought leaders in this area?
  • Is research in this area heating up?

Since my workflow is very disease-centric, I usually start by searching the OMIM database.  OMIM provides a good overview of the disease, with information on the genes involved, and relevant literature.  Recently, I’ve also added Wikipedia to the list.  I’ve been pleasantly surprised with the depth of information available on Wikipedia, both for diseases and for genes.  In addition, to these more general sources, the National Cancer Institute’s PDQ site provides a good overview of the clinical stages of the disease, and the standards of care applied at each stage.  This information is critical for two reasons.  It gives discovery scientists insight into the clinical presentation of the disease, and makes it possible to design a drug or cocktail that targets a particular patient population.

My usual starting point for most research projects is PubMed.  And I start by looking for review papers on a topic of interest.  In this case my query looks like this:
(pancreatic cancer) AND “review”[Publication Type] 

You can further restrict the results by limiting hits to the last few years. Sorting by publication date also helps focus your attention on the latest developments.  You’ll find more tips and tricks for using PubMed here.

As I read through the review papers, I compile a list of genes which I keep in two “piles” — targets and biomarkers.  I also compile a list of pathways, and attempt to connect those to specific biological processes involved in the disease.

Gene-centric vs Pathway Centric vs Disease Centric Workflows
When I first started out in this industry, I thought, perhaps rather naively, that drug discovery research always followed the same path, and consequently that every company used the same approach to identify new drug candidates.  However, I quickly learned that this wasn’t the case.

Some companies used a traditional compound-centric approach to drug discovery.  They would screen a compound through a particular target panel, find some interesting binding characteristics for a target, and then back-track to an indication or set of indications.

In a gene-centric approach, the process starts with a gene.  The function of the gene is determined (at least initially) by the Gene Ontology terms, by literature, by sequence homology, by protein domain, etc. Depending on the drug class (small molecule vs peptide or antibody, siRNA, etc) certain types of genes/transcripts/proteins may be more or less amenable to being addressed. For example, antibodies may be more appropriate for targets that have extracellular domains to which the antibody can attach.

A few years ago, Novartis espoused a more pathway-centric approach to drug discovery. The aim of which was to use the signaling pathways to help identify new targets, either for monotherapies, or collections of targets for drug cocktails, or for repurposing existing drugs.

In a disease-centric approach, the disease biology, and the genes that drive that biology are used to drive the strategy for therapeutic development.  This approach, originally pioneered by organizations with a vested interest in research in particular disease areas, appears to be the most promising.  These organizations, that I loosely classify as “Translational Medicine Companies”, have a great deal of knowledge and experience in a particular indication, and thus tend to take a systems biology approach to identifying potential targets and drug candidates.  Organizations like the Michael J. Fox Foundation, and globalCure (an initiative of Translational Genomics Institute to find new treatments for Pancreatic Cancer) spring to mind.


Posted in Bioinformatics, Cancer Research, Drug Development, Informatics, pancreatic cancer, Science Blogging | Leave a comment

Exchanging Drug Pipeline Data

The pharmaceutical R&D environment has always been collaborative in nature — never moreso today. Key to the success of those collaborations is the ability to share information about drug R&D programs with a wide variety of potential partners and investors. Traditionally pharmaceutical companies rely on expensive databases to identify potential partners. These databases usually do a good job of identifying programs inpipeline032415 other pharmaceutical companies, but their results vary widely when it comes to identifying programs in academic labs and small biotech companies. And it’s increasingly these types of organizations that pharmaceutical companies are turning to in an effort to reduce R&D costs, and gain specialist expertise in certain indications.

In the past, small biotech companies have relied upon events like BIO and EBD (and the previously mentioned commercial databases) to get on the radar of pharmaceutical companies. However, these events occur only once a year and a year can be a long wait for a startup company.  In addition, any discrepancies in the project information in a commercial database can take months to resolve, which can lead to more lost opportunities.

A cursory survey of pharmaceutical company web sites reveals that despite the dazzling variety of ways that pharmaceutical pipelines are represented, the data is by-and-large the same across all of the sites — Target, therapeutic area and class, indication, and project status are a part of every pipeline page. However, because the webpage and the data are tightly bound together, it’s impossible to scrape the data programmatically, and search across all of the organizations.

But suppose for a moment, that every drug pipeline, at every company involved in pharmaceutical discovery was just a Google search away.  Suppose, that regardless of the size of your company, the work that you’re doing was instantly discoverable by potential partners and investors.

The first step in such an effort would be to make drug project information accessible by search engines. Two years ago, Google (along with Yahoo and Bing) announced support for a new site metadata standard using JSON-LD at their Google IO developers conference. This new data format makes it possible for companies to describe themselves and their products (albeit in very generic terms). Google, Yahoo and Bing display this information in a summary to the right of your search results.

Recently, we proposed a pre-competitive collaborative project with the Pistoia Alliance (an industry-wide organization with representatives from numerous pharmaceutical companies) to define a new standard for representing pharmaceutical project information.

Our goal is to create a level playing field that ultimately helps the members of the pharmaceutical R&D ecosystem (academic labs, biotech companies, research foundations, and pharmaceutical companies) identify new collaborative opportunities and answer the following types of questions:

  • Which organizations currently have drug programs for indication X?
  • Which organizations are currently working on complementary drug programs in pathway Y?
  • Which organizations have a drug program that targets gene Z?
  • I have a drug program for indication X, the target also plays a role in indication Y.  Who has expertise in that area that I can leverage?
  • Which potential partners are best suited for my drug program?
  • Who do I contact at company X about my cancer drug program?
  • Who is currently conducting clinical trials for indication X?

To learn more about this project and how you can help, please join the conversation at the Pistoia Alliance.

Posted in Uncategorized | Leave a comment

Parsing Data Files With BioGroovy

In a previous post we showed how you can use BioGroovy to download a data file from a webservice.  In this post, we’ll show you how to parse the file that you’ve downloaded.

After handling all of the imports (shown in the first block of code), the real work starts with the EntrezGeneSlurper.  We create a file reference to the datafile we downloaded earlier, and then call the fetcher.read() method to read the datafile and print the results out to the console.

Grab (group='org.biogroovy', module='biogroovy', version='1.1')
import org.biogroovy.io.eutils.EntrezGeneSlurper
import org.biogroovy.models.Gene
import java.io.File
import java.io.FileInputStream

EntrezGeneSlurper fetcher = new EntrezGeneSlurper();
File file = new File("/path/to/file.gene.xml");
Gene gene = fetcher.read(new FileInputStream(file));
println gene;

Posted in Uncategorized | Leave a comment

Fetching Data Files With BioGroovy

It’s often useful to fetch data from a web service and cache it in your local directory. This cuts down on the amount of traffic between your machine and the web service you’re invoking, and it makes unit testing your code easier. In this example, we’ll fetch data for 3 genes and write it out as XML files.

 import org.biogroovy.io.eutils.EntrezGeneSlurper;
 import org.biogroovy.models.Gene;
 @Grab(group='org.biogroovy', module='biogroovy', version='1.1')

 EntrezGeneSlurper slurper = new EntrezGeneSlurper();
 File dir = new File(System.getProperty('user.home'), 'biogroovy_test_files');
 slurper.fetchAsFile(dir, "675,1034,133")

 // double check that the files were written out
 dir.list().each{println it}
Posted in Bioinformatics, Informatics | Tagged | Leave a comment

Groovy vs Python

At a recent Informatics Lunch meeting, surrounded by a host of dedicated Python users I confessed to being somewhat ignorant of the business drivers behind using Python in a bioinformatics setting. Although, I have used Python on occasion, it was usually to solve a customer’s problem, and not my first choice when reaching into the toolbox.  My feeling is that Python has become the de facto standard for bioinformatics scripting more as an historical fluke rather than as a result of duking it out for top-of-the-technology pile against other languages.

Oddly enough, during that conversation, a lot of the conversation floated around Python’s shortcomings, rather than its strengths.  Some of these shortcomings included the Python 2.0 vs 3.0 backwards incompatibility problem, and the fact that compiling Python into native code is tricky and requires you to keep in mind platform-specific differences. Luckily, neither of these problems have been inflicted on Java or Groovy users.

Some of the typical use-cases cited for using Python, include:

It’s scriptable

Groovy has always been scriptable. The Groovy console lets you compose and execute scripts.  It can also be compiled to Java byte-code and thus run on any JVM. With the release of Java 9, Java itself is also scriptable using the JShell scripting console.

I can read data files with it

The Apache POI library provides support for reading and writing Microsoft Office files like PowerPoint, Word and Excel. Here’s a simple example that shows you how to iterate through the cells in an Excel spreadsheet.

// read the Excel file
Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));

// get the first sheet
Sheet sheet = wb.getSheetAt(0);

// Decide which rows to process
int rowStart = Math.min(15, sheet.getFirstRowNum());
int rowEnd = Math.max(1400, sheet.getLastRowNum());

for (int rowNum = rowStart; rowNum < rowEnd; rowNum++) {
 Row r = sheet.getRow(rowNum);
 if (r == null) {
 // This whole row is empty
 // Handle it as needed

 int lastColumn = Math.max(r.getLastCellNum(), MY_MINIMUM_COLUMN_COUNT);

 for (int cn = 0; cn < lastColumn; cn++) {
 Cell c = r.getCell(cn, Row.RETURN_BLANK_AS_NULL);
 if (c == null) {
 // The spreadsheet is empty in this cell
 } else {
 // Do something useful with the cell's contents

I can query databases with it

Groovy can make use of any JDBC data source, NoSQL database, graph database, or cloud database.  Here’s a simple example that shows you how to execute a JDBC query and iterate through the results.

// create a database connection to an in-memory hsql database
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbc.JDBCDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
//query data table called 'project'
sql.eachRow('select * from PROJECT', 2, 2) { row ->
     println "${row.name.padRight(10)} ($row.url)"

I can do text mining with it

The WEKA library is designed with text mining in mind.  Since the library is Java-based, we can easily add it to a project, and execute it as a script.  There are plenty of examples that show how to use WEKA here.

I can graph data with it

The Java UI library JavaFX includes a variety of graphing components including pie charts, line charts, area charts, etc. The subject is a little beyond the scope of this post, but Oracle provides some great tutorials on the subject here.  And here’s an example from StackOverflow that shows how to output the chart as a PNG image without displaying it.


These are just some of the use-cases that were cited during the meeting.  So, I’d like to throw it out there to the audience — if you have some additional use cases for Python, I’d love to hear more about them in the comments section.



Posted in Uncategorized | Leave a comment

BD2K – The NIHs Big Data To Knowledge Effort

The topic at a recent Bioinformatics lunch discussion was the NIH’s Big Data To Knowledge (BD2K) program.  The BD2K’s Mission Statement (below) gives you some ideas for where the program is heading…

BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement.

The BD2K initiative addresses four major aims that, in combination, are meant to enhance the utility of biomedical Big Data:

  • To facilitate broad use of biomedical digital assets by making them discoverable, accessible, and citable.
  • To conduct research and develop the methods, software, and tools needed to analyze biomedical Big Data.
  • To enhance training in the development and use of methods and tools necessary for biomedical Big Data science.
  • To support a data ecosystem that accelerates discovery as part of a digital enterprise.

Overall, the focus of the BD2K program is to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of Big Data and data science into biomedical research. [BD2K Mission Statement]

But as the interview below, with project director and noted bioinformaticist Phil Bourne indicates, the scope of the initiative spans the entire research life cycle (ideas, hypotheses, experiments, data, analysis, comprehension, dissemination). In concrete terms that means tools for authoring, collecting, analyzing and visualizing and data.  In part, the project hopes to address some of the issues around reproducibility, discoverability and provenance of data and software that have impeded the ability of industry to leverage academic research.[Assessing Credibility Via Reproducibility] [Reproducibility and Provenance]

The Commons is a program within the BD2K initiative that seeks to improve the discoverability of data, provide open APIs (data and tools), unique IDs for research objects, containers for packaged applications, running in cloud and HPC environments.

My goal, at this month’s meeting was to learn more about the program from some of the local participants.

Ben Good from the Su Lab at The Scripps Research Institute described some of the work they were doing.  Earlier this year, they held a Network of Biothings/BD2K Hackathon to kickstart some projects.  During that meeting the

Chunlei Wu, also of the Su Lab, is developing a “Community Platform for Data Wrangling of Gene and Genetic Variant Annotations”.

At the PDB, Peter Rose has been working on a compression technique for 3D protein structures as part of the BD2K’s Targeted Software Development program.  This technique makes it possible to stream complex 3D structures in the same way that you might stream a YouTube video.

Posted in Uncategorized | Leave a comment