Proxy Service Optimization

Over the past few days, many people have noticed some problems with the MySpace feeds. That’s because over the past few days I’ve been experimenting with potential ways to speed up these feeds. Some of these experiments have been less successful, some breaking the feeds altogether. But I think I’ve come up with some successful solutions, which I’ll document here.

The big bottleneck in the MySpace feed services, as with any proxy service, is getting the original content. MySpace pages weigh in at around 45kb each. At a dozen or requests every minute, that’s a lot. So step one, taken a long time ago, was reducing that load.

It’s the nature of feeds that they’re requested much more often than there’s actually new content. That’s sort of the point of feeds. Ideally the feeds would only reload content from MySpace when there’s something new to load. There are various methods built into HTTP, e.g. etags, to ask a server “is there anything new on this page?” Unfortunately, MySpace doesn’t support any of them, so the only way to find out if there’s something new is to look at the actual page.

My solution to this was to only look at the actual page once per hour. I figure a one hour lag between when your friend updates her MySpace page and when that update shows up in your feed is okay. If you need to know sooner than that, you should 1) go outside or 2) check MySpace directly. So if a page was already reloaded in the past hour, I use a local copy instead of requesting an update from MySpace.

And this was all the optimization the feed service needed for a long time. But lately the feeds have become popular enough that more was needed. I was saving the cached pages in a database, which held both the content and the time of the last update. Fetching the time was almost instant, but fetching 45kb of text out of a database is a relatively slow process, and due to the centralized nature of a database, a few dozen relatively slow database queries will quickly create a backlog that takes a few minutes (forever in internet time) to clear up.

While that backlog is clearing up, a script is waiting for a response from the database. But scripts take up resources, so they can’t just wait around forever. If the database request takes too long, the script times out, and that’s when you see an error message saying my server did not respond. That’s no good.

So the next optimization step was to move the content from the database to individual files. Files don’t take as long to read and write because they don’t need to be indexed for searching like a database record. And if one file does take a long time to read, that doesn’t slow down all the other files as it would in a database. I still kept the update times in the database, because those need to be requested by URL, and I didn’t want to deal with working out some sort of URL-to-file-name mapping.

So that sped up things quite a bit, but I’d still notice a bit of backlog in the database requests every now and then. Why was it taking so long to read and update a simple time in the database? It turns out it was taking so long to find the record of a specific URL.

I was saving the URLs in a TEXT field in the database. I knew VARCHAR fields were indexed faster, but they’re limited to 255 characters, and I wasn’t sure if I’d be getting URLs longer than 255 characters. So I used TEXT, but I created an index, so searches would be faster. The additional time required to update a TEXT index on adding new URLs was apparently more than the time saved in searching, though, so the index actually slowed down the database even more.

So I revisited VARCHAR. Some quick research showed that, out of the 21,000 or so URLs I had cached, only one URL was longer than 255 characters, and it wasn’t even valid. So my third step, after creating a cache and then moving the cache content from the database to the file system, was to index URLs as VARCHAR instead of TEXT.

Now everything seems to be nice and speedy. There’s still a lag when a page actually needs to be requested from MySpace, but for all other requests, the feeds seem to be loading almost instantly. There’s one question I was asked several times by people who noticed the feeds were slow and/or broken: why don’t I just spread out the load by giving out copies of the code? That would certainly remove the need for optimization. But I actually have a few different reasons for not doing that, which I’ll save for the next article.

One Trackback/Pingback

  1. […] As I mentioned previously, many people have asked me about releasing the source code […]

Post a Comment

Your email is never published nor shared.