Mining RSS Feeds For Cancer News

I sometimes get asked how I keep up to date with various discoveries and technologies.  The simple answer is: RSS feeds.  What’s RSS?  The short answer is, it’s a means to collect news articles and journal articles without having to visit each individual site.  The long answer… is available on wikipedia.

I use Google Reader to aggregate feeds from various sites.  This makes it easier to read all biotech or pharma-related news with fewer clicks.  You’ll find some of my previous postings about Google Reader here.

But what happens when you have a broad set of RSS feeds and you want to extract specific information from it?  For example, in my latest posting, I collected a number of recent news articles on pancreatic cancer.  Normally, I would simply read the article and star it, and add a tag like “pancreatic cancer” to it so that I could find the article again.  At the end of the week, I would collect the articles and provide a short write up along with the list of the latest papers added to the Mendeley group.

One way to make this process easier, is to use a free web-based workflow technology called Yahoo Pipes.  Unlike tools like Taverna and Pipeline Pilot, Yahoo Pipes is a generic technology for mining anything in XML or HTML.  Which makes it ideally suited for mining cancer-related articles.  With Yahoo Pipes, you simply add an RSS Feed source, apply a filter, and output the results.  You can even subscribe to the results as an RSS feed.  So if you wanted to get a list of recent articles on breast cancer, and another set of articles on prostate cancer, you simply add the appropriate filters, and get the results.  Let’s take a look…

  1. First, go to the Yahoo Pipes site.  If you don’t have a Yahoo account, you’ll need to sign up for one.
  2. Click on the Create Pipe button on the left hand side of the screen.  A blank workflow area will appear.  The left-hand side of the screen contains a palette of widgets, including data sources that you can add to your workflow.
  3. Click the Sources node, and then drag the Fetch Feed node from your palette to your work area.  Enter the URL for the feed that you want to mine.  In my case, in order to find the URL I wanted to use, I clicked on my Google Reader folder where I aggregate all of my biotech and pharma news feeds, and selected “Folder Settings…/View details and statistics” menu item.  I pasted this URL into my workflow’s Fetch Feed processor.
  4. Back in Yahoo Pipes, I then selected the Operators/Filter node and dragged it into my workspace.  I dragged the output of the Fetch Feed processor to the input of the Filter processor.
  5. I added two rules to the filter so that it would show me only those articles that contain “pancreatic cancer” in the title or the description, and then connected the output to the Pipe Output processor.
  6. To run your workflow, simply click the Run Pipe node, and the output will appear at the bottom of your screen in the Debugger window.  To subscribe to the results of your pipe as RSS, simply click the “Back to My Pipes” link in the upper right corner, hover over the name of the pipe you just made, and click View Results.  You can subscribe to them on Yahoo, Google Reader, or simply get the raw RSS feed as XML or JSON.

Yahoo Pipes


About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Informatics and tagged , , , , . Bookmark the permalink.

One Response to Mining RSS Feeds For Cancer News

  1. Pingback: Your Feed is being Piped « Remember Design?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s