Web Crawler and Deep Web Research » Page 'Improved RSS feed and web page change history'

Improved RSS feed and web page change history

The RSS feed from siteupdatenotification.com to alert users of changes being made to web pages in their watchlist has had a makeover. The feeds quickly grew to large sizes where a web page changes regularly, or if a user is monitoring more than a few web pages, and large RSS feeds make for a less reponsive user experience.

The feed has now been modified to contain only the most recent set of changes to each page in the user’s watchlist. Most users want to know about new content, so this serves their purpose perfectly. In place of the full change history, the feed now includes links to images that have changed, or have newly appeared. This means that new & updated images can be seen directly in the RSS reader, or on the siteupdatenotification.com website before visiting the changed site.

For users that would like to see the change history of a page, a link is provided to a page showing the history as seen by the sunaweb crawler for a specific page.

Comments

  1. hmm, while I can see the benifit of this change, it means that google reader never shows changed pages in the RSS feed as new items. (at least I’d assume this is what’s causing the effect)

    I’ll leave it to you to figure out the best way to fix this, but it pretty much destroys the value of this service for me.

    (Note, google does update the already read items, so it is seeing the changes. It’s just not marking them as new items in the feed)

  2. This was an unintended side effect of the change - thanks for pointing it out.

    I think I’ve fixed the problem now. The RSS feed now contains all change summaries in date/time order with a hyperlink to the site change history page. You may need to hit Refresh on GoogleReader, or even re-subscribe to pick up the changes.

    Regards
    – Craig

  3. It worked. Both of the pages that I’m watching that had changes showed up as new items. I didn’t have to do anything.

    Howeever 2 new bugs were noted:
    1) none of the links to the change summaries worked.

    In Firefox it just keeps trying to load forever.
    In IE it gives the error message:

    The XML page cannot be displayed
    Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
    ——————————————————————————–
    An invalid character was found in text content. Error processing resource ‘http://www.siteupdatenotification.com/watch/chan…
    <![CDATA[

    2) If the title of the watched page isn’t in english, it comes through as gibberish. (Probably just need to have the correct charset declared.) The page I’m watching where I notice this is http://www.fya.jp/~kj2/main.html

    (We’ll work all the bugs out of this eventually.)

  4. Ok, I’ve fixed the invalid XML - tested in
    * Internet Explorer v8 Beta 2
    * FireFox 3.0.5
    * Chrome 1.0.154.36
    * Opera 9.10 Build 8679

    The problem with Japanese characters needs some more investigation. In the meantime, I’ve added text processing to replace non-printable characters with question marks.

    Thanks for your help.
    – Craig