Remember, Remember…

nationalpancreaticcancerawareness-month“… the Fifth of November! “, so the old rhyme goes. And as every British schoolchild knows, this day marks the day that Guy Fawkes attempted to blow up the Houses of Parliament in 1605. For families of pancreatic cancer patients, November is the Pancreatic Cancer Awareness month, — a month filled with fundraising and awareness raising activities.

For my family, today marks my father’s birthday, and the day when my mother was diagnosed with pancreatic cancer 21 years ago.  For me, it’s a time to reflect on how far we’ve come in our understanding of the disease, and how far we have to go.

The advent of the genomic era brought with it a slew of technologies that fundamentally changed our understanding of pancreatic cancer. Affymetrix GeneChips that let us identify genes that were differentially expressed in pancreatic cancer; Next Generation Sequencing, Whole Genome Sequencing, Whole Exome Sequencing and RNASeq that helped us see the mutational landscape of pancreatic cancer, and much more.

The first of these discoveries was the PanIN (Pancreatic Intraepithelial Neoplasia) model that describes the early neoplastic changes that occur in pancreatic cancer. These early lesions had been nearly 100 years earlier, and had been known by various names including ductal hyperplasia, hypertrophy, metaplasia and dysplasia, but a progressive model that described the underlying genetic changes had heretofore never been attempted. In 2000, Ralph Hruban of Johns Hopkins, outlined the histopathologic changes and identified mutations in KRAS, CDKN2A, TP53, and SMAD4 as drivers in this process in his paper entitled “Progression Model for Pancreatic Cancer”.



In a follow-up paper entitled “Update to Pancreatic Intraepithelial Neoplasia”, Hruban described how the progression model had been used to create genetically engineered mouse models, which are essential to helping researchers create and test new drugs. He also described how the model could be used for improved early diagnostics.

In 2002, Christine Iacobuzio-Donahue used Affymetrix GeneChips to identify differentially expressed genes in pancreatic cancer that might be used to help diagnose the disease. This paper, entitled “Discovery of Novel Tumor Markers of Pancreatic Cancer using Global Gene Expression Technology”, identified 97 differentially expressed genes that could potentially be used as biomarkers in future diagnostic tests.

This early research gave us some clues about the early progression of the disease and potential diagnostics, but we still didn’t have an appreciation for the genetic complexity of pancreatic cancer, until 2008, when Sian Jones of Johns Hopkins published a paper entitled “Core signaling pathways in human pancreatic cancers revealed by global genomic analyses” [Jones, et al]. The paper used a limited number of tumor samples (n=24) to identify an average of 63 modifications that occur during pancreatic cancer.

The genes identified in the paper fell into the following categories/pathways: KRAS signalling, TGFB signalling, JNK signalling, integrin signalling, Wnt/Notch signalling, hedgehog signalling, control of G1/S Phase transition, apoptosis, DNA damage control, small GTP-ase signalling, invasion, and cell-cell adhesion.

A subsequent paper, “Distant Metastasis Occurs Late during the Genetic Evolution of Pancreatic Cancer” [Yachida & Jones, et al] published in 2010, established a timeline for the progression of pancreatic cancer of over 20 years, thus providing us with a longer potential window of opportunity to diagnose and treat this disease.


And a follow-on paper, also by Yachida further established how alterations in KRAS, CDKN2A, TP53, and SMAD4 (the most commonly mutated genes in pancreatic cancer) can directly influence the patient outcomes. “Clinical significance of the genetic landscape of pancreatic cancer and implications for identification of potential long-term survivors.” [Yachida et al]

Additional tools began to make their way into the lab and helped us gain a better understanding of the importance of epigenetic changes in driving pancreatic cancer. We were beginning to understand how a gene like CDKN2A could become inactivated in pancreatic cancer due to promoter hypermethylation. “Hypermethylation of multiple genes in pancreatic adenocarcinoma” [Ueki et al]

And beyond epigenetics, we were beginning to see the roles that microRNAs play in pancreatic cancer, acting sometimes as tumor suppressors, and inhibiting invasion and migration. These new potential drug targets also brought with them a whole new potential therapeutic class: oligonucleotides, stretches of man-made RNA that could bond to the microRNA and interfere with them in ways that small molecules could not. In addition, researchers began exploring how circulating microRNAs could be used as diagnostic tools in pancreatic cancer.

These new tools brought with them the promise of new diagnostics, and new therapies, and a deeper understanding of the disease necessary to begin to make progress.  In the posts that follow, we’ll take a look at some of the new pathways that were discovered, the role of familial genetics and smoking in pancreatic cancer, and the promise of precision medicine and pancreatic subtypes.  We’ll also take a closer look at the pipelines of drug companies both large and small, and what promises they hold for the pancreatic cancer patients of tomorrow.


Posted in Cancer Research, pancreatic cancer | Tagged , , | Leave a comment

PubMed vs EuropePMC: Let’s Get Ready To Rumble


For most researchers, PubMed is the go-to resource for all biomedical literature. But from a programmatic standpoint it has some real challenges that make it difficult to integrate into many informatics applications.  Let’s take a look at a typical application.

Suppose we have an internal application used for target identification and tracking, and we want to add the ability to perform literature searches, and add selected hits to a specific target.

To do this using PubMed’s eUtil’s API requires two calls which have overly verbose results that must be parsed.  (Click on the Sample Call links below to see an example of what the call looks like, as well as the server response).

  1. Perform a search, and get a list of PubMed IDs. [Sample Call]
  2. Fetch the PubMed records, allow the user to review them, and then save the selected records. [Sample Call]

The first problem is that the search only returns IDs and search metadata. It doesn’t return titles, or abstracts, or anything else that a user might find useful in making a decision about which article to download or view.

The second problem is that the results when fetching PubMed articles are too verbose. The response is only available in XML, not JSON, and this has a performance impact. For example, all of the dates found in the record appear as separate tags.

Rather than:
<date-created date="2017-01-27"/>
or even
<date-created year="2017" month="01" day="27"/>

That’s 79 characters vs 33 or 47 (depending on which format you prefer).

A simple author name appears like this is:

<Author ValidYN="Y">
where this would do
<author lastname="Tao" firstname="Huimin" initials="H"/>
That’s 107 characters vs 56.
On the surface these seem like niggling complaints, but when you take into account the fact that the record size negatively impacts the speed and responsiveness of your application, and the amount of memory and processing power required to parse the data, then it has some serious implications for your application. For each author or date you could reduce the number of characters by half.
Aside from the verbosity of the results though, PubMed does not attempt to text mine abstract data. The record does not contain gene, protein, pathway or compound information which would make it truly useful in a drug discovery or literature mining application. The closest we come to getting article metadata are the MeSH (Medical Search Heading) terms.
Although BioGroovy makes it easy to search, download and parse PubMed records; it (like other libraries and applications) is not immune to the limitations of the eUtils API.


Perhaps the best alternative to PubMed is EuropePMC.  The database includes both PubMed abstracts, and PubMed Central full text articles.  The EuropePMC API provides you with both XML and JSON response formats. Let’s take a look at our previous algorithm, and how EuropePMC’s API differs from PubMed’s.

  1. Perform a search. [Sample Call]
  2. Fetch the selected records [Sample Call]

One of the first things you’ll notice is that the search results actually contain useful information.  In the sample below, we can see a title, the DOI, a well-formatted author, the journal. We can even see if the article has text-mined terms associated with it.

id: "28094263",
source: "MED",
pmid: "28094263",
doi: "10.1038/nrclinonc.2017.3",
title: "Pancreatic cancer: Pancreatic cancer cells digest extracellular protein.",
authorString: "Sidaway P.",
journalTitle: "Nat Rev Clin Oncol",
pubYear: "2017",
journalIssn: "1759-4774; 1759-4782; ",
pubType: "journal article",
isOpenAccess: "N",
inEPMC: "N",
inPMC: "N",
hasPDF: "N",
hasBook: "N",
citedByCount: 0,
hasReferences: "N",
hasTextMinedTerms: "N",
hasDbCrossReferences: "N",
hasLabsLinks: "Y",
epmcAuthMan: "N",
hasTMAccessionNumbers: "N"


What makes this especially useful is the results can easily be used in a user interface, and contain enough information to allow a user to determine whether or not the article is potentially useful.

You can also fetch text mined terms, such as genes, diseases, and chemicals from EuropePMC records as well. [Sample Call]  For example, in the previous call we’re returning all terms from a particular record. One of those terms is a record for the chemical taxol which is used as a chemotherapeutic agent. The compound metadata includes information from the CHEBI chemical database.


Posted in Bioinformatics, Informatics | Tagged , , | Leave a comment

A New Year, A New Site, A New Service

It’s the start of a new year, and nothing says New Year like a website refresh. With the rise in the number of visitors to the site using mobile browsers, we’ve updated the site to make it more mobile friendly. Not only is it more easily viewable on smartphones and tablets, but you can add it to your home screen just like any other app.


That last bit is important because we’re also announcing the launch of a new service simply called Aspen Gene. With it you can look up information on any gene. The service is powered by the web service developed by Chunlei Wu at The Scripps Research Institute’s Su Lab.  To visualize the data, we’ve developed a series of web components and are making them available through the new open source BioPolymer project.


The Aspen Gene Search interface

Let’s take a look at the service. You start by entering in the symbol for a gene of interest, in this KRAS. Then tap the “Search” button to start the search.  The search results will then appear as a series of cards at the bottom of the screen. Tap on the arrow icon on the result card, and the gene summary will appear as shown below.

Gene Summary

The Summary tab provides an overview of the gene, including its symbol, synonyms, and IDs in related databases.  We’re currently linking to NCBI’s EntrezGene database, the Online Mendelian Inheritance In Man (OMIM), the Human Genome Nomenclature Committee, UniGene, and PharmGKB.  You can tap on the icon to the right of the field to open the record in a new window.


Gene Summary Tab

Protein Information

The Protein tab shows the UniProt ID, along with a list of InterPro domains found in the protein.  You can tap on any of the domains to see more information.  The Protein Database section shows a list of PDB IDs. You can tap on any ID to display the associated protein structure.


Protein Information Tab

Pathway Information

The Pathways tab shows a list of all of the pathways that the gene participates in. This includes entries from KEGG, Reactome, PharmGKB, Wikipathways and more. You can tap on any pathway name to see a diagram of the pathway.


Pathway Information Tab



The Publications tab shows a list of GeneRIF publications. These are References Into Function, or papers that indicate the function of a gene, and are found in the NCBI EntrezGene database. Tap on the card to display the PubMed record for the article.


Publication Information Tab

So, come visit us, and give our new site (and our new service a try)!

Posted in Bioinformatics, Informatics, Science Blogging | Leave a comment

Drug Target Identification: And then a miracle occurs


At the front end of most drug discovery programs lies a step called Target Identification, and a few months ago I sat down with a colleague to discuss their approach to target identification.  In particular, “how do you characterize a target”? I was surprised at how much that process can vary from company to company.

As I set out to describe my workflow for this blog post, I was reminded of this cartoon, and how much work goes on between the starting point and the end point when researching the function of genes.

I should preface what I’m about to say, with the words “this is the way I work” your goals and tools might be different, and I’m always curious about the way people work.  So please feel free to comment.

At a macroscopic level (regardless of your ultimate research goals) there are three levels of research:

  1. Foundational Research: where you familiarize yourself with the general “landscape” of a particular research topic.
  2. Deep Dive Research: where you examine certain concepts exposed in step 1 in-depth.
  3. Current Research: where you create a “surveillance” program to keep yourself up-to-date with the latest developments in a particular area of research.

In the examples which follow, I’ll be showing you the steps that I take and the tools that I use to learn more about the target space for pancreatic cancer.

Foundational Research
My goal at this stage in the game is to answer the following questions:

  • What is the etiology of the disease? (What syndromes predispose people to the disease and what percentage of the patient population do they account for?)
  • How does the disease progress? (What are the clinical stages?)
  • What genes & proteins are involved in the progression of the disease?
  • What pathways & disease processes do they participate in?
  • What is the current standard of care, and what genes are targeted by that standard of care?
  • Who are the thought leaders in this area?
  • Is research in this area heating up?

Since my workflow is very disease-centric, I usually start by searching the OMIM database.  OMIM provides a good overview of the disease, with information on the genes involved, and relevant literature.  Recently, I’ve also added Wikipedia to the list.  I’ve been pleasantly surprised with the depth of information available on Wikipedia, both for diseases and for genes.  In addition, to these more general sources, the National Cancer Institute’s PDQ site provides a good overview of the clinical stages of the disease, and the standards of care applied at each stage.  This information is critical for two reasons.  It gives discovery scientists insight into the clinical presentation of the disease, and makes it possible to design a drug or cocktail that targets a particular patient population.

My usual starting point for most research projects is PubMed.  And I start by looking for review papers on a topic of interest.  In this case my query looks like this:
(pancreatic cancer) AND “review”[Publication Type] 

You can further restrict the results by limiting hits to the last few years. Sorting by publication date also helps focus your attention on the latest developments.  You’ll find more tips and tricks for using PubMed here.

As I read through the review papers, I compile a list of genes which I keep in two “piles” — targets and biomarkers.  I also compile a list of pathways, and attempt to connect those to specific biological processes involved in the disease.

Gene-centric vs Pathway Centric vs Disease Centric Workflows
When I first started out in this industry, I thought, perhaps rather naively, that drug discovery research always followed the same path, and consequently that every company used the same approach to identify new drug candidates.  However, I quickly learned that this wasn’t the case.

Some companies used a traditional compound-centric approach to drug discovery.  They would screen a compound through a particular target panel, find some interesting binding characteristics for a target, and then back-track to an indication or set of indications.

In a gene-centric approach, the process starts with a gene.  The function of the gene is determined (at least initially) by the Gene Ontology terms, by literature, by sequence homology, by protein domain, etc. Depending on the drug class (small molecule vs peptide or antibody, siRNA, etc) certain types of genes/transcripts/proteins may be more or less amenable to being addressed. For example, antibodies may be more appropriate for targets that have extracellular domains to which the antibody can attach.

A few years ago, Novartis espoused a more pathway-centric approach to drug discovery. The aim of which was to use the signaling pathways to help identify new targets, either for monotherapies, or collections of targets for drug cocktails, or for repurposing existing drugs.

In a disease-centric approach, the disease biology, and the genes that drive that biology are used to drive the strategy for therapeutic development.  This approach, originally pioneered by organizations with a vested interest in research in particular disease areas, appears to be the most promising.  These organizations, that I loosely classify as “Translational Medicine Companies”, have a great deal of knowledge and experience in a particular indication, and thus tend to take a systems biology approach to identifying potential targets and drug candidates.  Organizations like the Michael J. Fox Foundation, and globalCure (an initiative of Translational Genomics Institute to find new treatments for Pancreatic Cancer) spring to mind.


Posted in Bioinformatics, Cancer Research, Drug Development, Informatics, pancreatic cancer, Science Blogging | Leave a comment

Exchanging Drug Pipeline Data

The pharmaceutical R&D environment has always been collaborative in nature — never moreso today. Key to the success of those collaborations is the ability to share information about drug R&D programs with a wide variety of potential partners and investors. Traditionally pharmaceutical companies rely on expensive databases to identify potential partners. These databases usually do a good job of identifying programs inpipeline032415 other pharmaceutical companies, but their results vary widely when it comes to identifying programs in academic labs and small biotech companies. And it’s increasingly these types of organizations that pharmaceutical companies are turning to in an effort to reduce R&D costs, and gain specialist expertise in certain indications.

In the past, small biotech companies have relied upon events like BIO and EBD (and the previously mentioned commercial databases) to get on the radar of pharmaceutical companies. However, these events occur only once a year and a year can be a long wait for a startup company.  In addition, any discrepancies in the project information in a commercial database can take months to resolve, which can lead to more lost opportunities.

A cursory survey of pharmaceutical company web sites reveals that despite the dazzling variety of ways that pharmaceutical pipelines are represented, the data is by-and-large the same across all of the sites — Target, therapeutic area and class, indication, and project status are a part of every pipeline page. However, because the webpage and the data are tightly bound together, it’s impossible to scrape the data programmatically, and search across all of the organizations.

But suppose for a moment, that every drug pipeline, at every company involved in pharmaceutical discovery was just a Google search away.  Suppose, that regardless of the size of your company, the work that you’re doing was instantly discoverable by potential partners and investors.

The first step in such an effort would be to make drug project information accessible by search engines. Two years ago, Google (along with Yahoo and Bing) announced support for a new site metadata standard using JSON-LD at their Google IO developers conference. This new data format makes it possible for companies to describe themselves and their products (albeit in very generic terms). Google, Yahoo and Bing display this information in a summary to the right of your search results.

Recently, we proposed a pre-competitive collaborative project with the Pistoia Alliance (an industry-wide organization with representatives from numerous pharmaceutical companies) to define a new standard for representing pharmaceutical project information.

Our goal is to create a level playing field that ultimately helps the members of the pharmaceutical R&D ecosystem (academic labs, biotech companies, research foundations, and pharmaceutical companies) identify new collaborative opportunities and answer the following types of questions:

  • Which organizations currently have drug programs for indication X?
  • Which organizations are currently working on complementary drug programs in pathway Y?
  • Which organizations have a drug program that targets gene Z?
  • I have a drug program for indication X, the target also plays a role in indication Y.  Who has expertise in that area that I can leverage?
  • Which potential partners are best suited for my drug program?
  • Who do I contact at company X about my cancer drug program?
  • Who is currently conducting clinical trials for indication X?

To learn more about this project and how you can help, please join the conversation at the Pistoia Alliance.

Posted in Uncategorized | Leave a comment

Parsing Data Files With BioGroovy

In a previous post we showed how you can use BioGroovy to download a data file from a webservice.  In this post, we’ll show you how to parse the file that you’ve downloaded.

After handling all of the imports (shown in the first block of code), the real work starts with the EntrezGeneSlurper.  We create a file reference to the datafile we downloaded earlier, and then call the method to read the datafile and print the results out to the console.

Grab (group='org.biogroovy', module='biogroovy', version='1.1')
import org.biogroovy.models.Gene

EntrezGeneSlurper fetcher = new EntrezGeneSlurper();
File file = new File("/path/to/file.gene.xml");
Gene gene = FileInputStream(file));
println gene;

Posted in Uncategorized | Leave a comment

Fetching Data Files With BioGroovy

It’s often useful to fetch data from a web service and cache it in your local directory. This cuts down on the amount of traffic between your machine and the web service you’re invoking, and it makes unit testing your code easier. In this example, we’ll fetch data for 3 genes and write it out as XML files.

 import org.biogroovy.models.Gene;
 @Grab(group='org.biogroovy', module='biogroovy', version='1.1')

 EntrezGeneSlurper slurper = new EntrezGeneSlurper();
 File dir = new File(System.getProperty('user.home'), 'biogroovy_test_files');
 slurper.fetchAsFile(dir, "675,1034,133")

 // double check that the files were written out
 dir.list().each{println it}
Posted in Bioinformatics, Informatics | Tagged | 2 Comments