Implementing a web based aggregator
Awhile back I was thinking of making a web based rss aggregator like Bloglines. I thought that the concept was pretty simple, so it shouldn’t be that hard to implement, but there are a couple of trouble areas that you run into when developing an rss aggregator.
- Parsing the feeds. This can get really hairy at times because you want to make sure that you can parse all the feeds that your users are subscribing too, even though those feeds may not be valid rss/atom feeds. In order to account for all the different feeds, you need a really robust feed parser, which can be a nightmare to write yourself, and I didn’t really find a good .net library that would handle this for me. I looked into grabbing the code from RSS Bandit to do it, and I also looked at using the Universal Feed Parser, which is written in Python, and using Iron Python to incorporate that.
- Scaling. This is huge and probably the biggest problem. The aggregator that I ended up writing works great for valid rss feeds, and as long as I’m the only one using it :) It can run on my single desktop just fine, but I suspect that if you loaded up any more than 30 - 40 users, the thing would crash and burn, especially if those users all had a nontrivial number of feeds that they subscribed to. The scalability problem has been well documented as feedlounge has been growing. They have been really transparent about the issues they’ve been hitting as they ramp up their service, and scaling has been their biggest obstacle.
Another idea that I had to try and alleviate the parsing problem, was to use Bloglines as sort of a proxy. Bloglines republishes the feeds that their users are subscribed to, and the feeds that they publish are guaranteed to be valid rss feeds, so I wouldn’t have to worry about parsing any invalid feeds, which makes life a lot easier.
In the end, I realized that I wouldn’t be able to afford the hardware that would be necessary to run a web based aggregator, so the project died off, but it was an interesting problem to work on as there are some different ways to implement the aggregator, so making some design choices was fun.