Implementing a web based aggregator

Awhile back I was thinking of making a web based rss aggregator like Bloglines. I thought that the concept was pretty simple, so it shouldn’t be that hard to implement, but there are a couple of trouble areas that you run into when developing an rss aggregator.

  • Parsing the feeds.  This can get really hairy at times because you want to make sure that you can parse all the feeds that your users are subscribing too, even though those feeds may not be valid rss/atom feeds.  In order to account for all the different feeds, you need a really robust feed parser, which can be a nightmare to write yourself, and I didn’t really find a good .net library that would handle this for me.  I looked into grabbing the code from RSS Bandit to do it, and I also looked at using the Universal Feed Parser, which is written in Python, and using Iron Python to incorporate that.
  • Scaling.  This is huge and probably the biggest problem.  The aggregator that I ended up writing works great for valid rss feeds, and as long as I’m the only one using it :)  It can run on my single desktop just fine, but I suspect that if you loaded up any more than 30 - 40 users, the thing would crash and burn, especially if those users all had a nontrivial number of feeds that they subscribed to. The scalability problem has been well documented as feedlounge has been growing. They have been really transparent about the issues they’ve been hitting as they ramp up their service, and scaling has been their biggest obstacle.

Another idea that I had to try and alleviate the parsing problem, was to use Bloglines as sort of a proxy.  Bloglines republishes the feeds that their users are subscribed to, and the feeds that they publish are guaranteed to be valid rss feeds, so I wouldn’t have to worry about parsing any invalid feeds, which makes life a lot easier.

In the end, I realized that I wouldn’t be able to afford the hardware that would be necessary to run a web based aggregator, so the project died off, but it was an interesting problem to work on as there are some different ways to implement the aggregator, so making some design choices was fun.

6 Comments so far »

  1. Anonymous said,

    Wrote on January 23, 2006 @ 12:58 pm

    Bloglines has an API, how about just writing a front end for it ( I *hate* the current Bloglines UI).

    As for the scalability, I can’t believe the FeedLounge guys had trouble fetching and processing 74 feeds per user. Were they trying to do this in real-time? Was there no overlap between the user subscriptions? This makes absolutely no sense (to me), surely it makes more sense to have a central list of feeds that are fetched and then show each user’s view of that central store of items. Maybe they did do it that way and I am missing something, but building a scalable back-end for this sort of thing is something I’d *love* to do. Maybe I’ll get around to it.

  2. breichelt said,

    Wrote on January 23, 2006 @ 1:13 pm

    Ross, thats the way that I implemented my little aggregator, if people have feeds in common, then they can use that to their advantage and only grab the feed once, just like you describe, and I’m sure that the feedlounge guys have done that.

    but, there must be some number of unqie feeds per user, and if there are 100 users, each with 74 feeds, theres a grand total of 7400 hundred unique feeds. so, assuming the worst case, if you want to update each subscribers feeds each hour, that amounts to 7400 web requests each hour, and for each web request there is the cost of indexing the new or updated items in that feed. if we assume one second per feed to get the feed over the web and to index it, thats 7400 seconds, since there are only 3600 seconds in an hour, you can see where you might run into problems :)
    granted, this example was assuming one machine, one processor doing the feed updates, if you throw multiple boxes at the problem, it gets better (obviously), but thats where the expense comes into play.

    I’m not sure how Bloglines or newsgator does their feed updating, they could just have a massive infrastructure, but I dont see a way around requesting those feeds on some sort of timed cycle.

    (one thing I know feedlounge did, was to be smart about what feeds they updated, feeds taht were rarely updated, were checked less frequently than feeds that were updated more often)

  3. Anonymous said,

    Wrote on January 23, 2006 @ 2:28 pm

    Ben,

    7400 feeds an hour is probably actually do-able. I’d posit a large number of requests would fail bad at the if-modified-since stage, and there is no reason you would need to do one per second, two or even three threads sharing even a small-ish connection should be enough. Assuming of course miniscule latency :)
    How you spread the load on the database would be an interesting problem, but most queries would be relatively simple (and hopefully lightweight), at least using the design I came up with last time I thought about this, but getting all of the feeds in might be the problem. I’m going to go and get some more coffee and think about it some more.

  4. breichelt said,

    Wrote on January 23, 2006 @ 3:05 pm

    Yep, you’re right, you could use a couple more threads and the if-modified-since header would also filter out some more work, but thats only a stop gap solution, because once you get enough users and enough unique feeds, the problem will crop up again.

    the database queries were pretty trivial, the database schema as a whole was pretty simplistic, and since there are that many writes occurring, mainly to mark items as read, you could optimize it pretty good. I’ll have to check, but I’m pretty sure I stored the body of the posts in the db tables, but another option would be to save the body to the file system, and use the db as just an index for that.

  5. Anonymous said,

    Wrote on January 24, 2006 @ 6:33 am

    I’m playing with the idea of individual databases almost on a per-user basis to handle the read items. I wonder how much a strain SQLite would put on the system, although I suspect the disk would very quickly become the bottleneck.

    I guess the problem is really stated as, how do you scale transparently without taking your system offline or your users noticing a degradation of performance whilst you scale up. It’s an issue that has been discussed elsewhere on blogs basically saying don’t bother designing for scalability too early - which I think is just wrong.

  6. breichelt said,

    Wrote on January 25, 2006 @ 1:26 pm

    I agree with you, it would be pretty hard to be able to scale the service using software alone, you would need more hardware, and the problem then becomes the transparency as you’ve mentioned. When bloglines moved datacenters for instance, the service was down for a few hours (which is acutually pretty damn good, i think)

Comment RSS · TrackBack URI

Leave a Comment

Name: (Required)

E-mail: (Required)

Website:

Enter my name (ben) in this box, so I know you're a human.

Comment: