Tips for Importing Large Datasets Using phpMyAdmin

I recently imported a large dataset using phpMyAdmin (no SSH) and learned a couple things along the way that I thought worth sharing (if only with my future self).

1) Use split

phpMyAdmin’s import has a maximum file size, which is significantly smaller than the dataset I was importing. My initial solution to this problem was to manually break up the text file containing the data. After doing this for a few minutes, I thought “I should make a tool to do this automatically.” Then I thought “Maybe someone already made that tool.” So I headed over to Google to find an application to download, only to discover it’s already on my machine. It’s a Unix command, aptly named “split.” Here’s how I used it:

split -l 250000 bigfile.csv

That gave me a bunch of files of 250,000 lines each, about 8MB in size, comfortably under my 10MB maximum.

2) Use gzip

phpMyAdmin can unpack compressed files. I got an error about file size headers when I tried using a zip file, so I went with gzip instead, and that worked fine, reducing my 8MB files to about 1.5MB. I used gzip like this:

gzip x*

You’re probably wondering: if my maximum upload size is 10MB, why didn’t I split into 30MB files and get 6MB gzips? I didn’t do that because of another limit: time. PHP scripts (including phpMyAdmin) can only run for so long at one go, and it takes a while to run 250,000 MySQL queries.
After that, it’s just a series of imports, which should run smoothly. If you’re willing and able to SSH to your database, of course, command-line MySQL makes this all a single step. But when that’s not an option, this will at least make the slow alternative less so.

Download Daily Show

You may remember that Viacom sued YouTube for a billion dollars over “Inducement of Copyright Infringement.” What YouTube does, specifically, is make it possible to download clips of, for example, the Daily Show, without watching Comedy Central’s ads. That’s certainly not the purpose of YouTube, but making it possible to download is a side affect of posting anything on the internet.

What many people don’t know is those Daily Show clips were also available on Comedy Central’s website, and could be downloaded through almost the exact same process through which they’re downloaded from YouTube. But Comedy Central’s website being incredibly difficult to access made such downloading more difficult. For one thing, it was almost impossible to locate any but the most recent clips on Comedy Central’s site.

Enter TheDailyShow.com, Viacom’s apparent attempt to compete with YouTube by offering a more accessible interface. Now every Daily Show clip has it’s own page at a semi-permanent address, and these pages are relatively easy to locate. So you can find and watch your favorite old Daily Show clip, just like you once could on YouTube, only without the same “Inducement of Copyright Infringement.” Why didn’t YouTube just do that?

It turns out they did. The increased accessibility of TheDailyShow.com also makes it incredibly easy to download Daily Show clips just as they were downloaded from YouTube. If you follow that link, you’ll see just how easy it is. Paste in a link, start downloading. The lesson here: on the internet, “Inducement of Copyright Infringement” is pretty much synonymous with “accessible.” You can’t view something on the internet without first downloading it, so the only way to make it difficult-to-download is to make it difficult-to-view. I suspect someone at Viacom imagined they could built an easy-to-view but difficult-to-download video website to demonstrate what YouTube could be, and why they’re suing. That someone was completely wrong.

25 Statements About Microformats

At the last Refresh Denver meeting, I talked about microformats. Rather than doing a slideshow presentation, I just wrote down and passed out 25 statements about microformats that I hoped would be interesting enough to drive discussion. It worked out well enough, I think. In hindsight, it probably would have been better to focus less on the abstract concepts and more on the specifics of how microformats work. But I definitely plan to use this kind of “25 statements” style for future talks, as it ends up being more flexible and more participatory than a typical presentation.

MySpace Code Licensing

As I mentioned previously, many people have asked me about releasing the source code for various MySpace feed services. Back when I first created the blog service, it was actually released under an open source license. That was a mistake. The code was written to work on my specific server, and most people downloading it and trying to install it on different servers didn’t understand it well enough to make the adjustments necessary to get it working. So what I got in exchange for releasing the source code was a series of requests for help installing and customizing, and not a single code contribution.

Why don’t I just rewrite the code to work better on other servers? To turn that back around, why would I? I created the original services to solve problems for myself. Specifically, I want to be able to keep up with my friends’ words and events without checking MySpace regularly. Then I made the services available for others to freely use. I’d already made them for myself, so that wasn’t much work. But packaging the code to run better on other servers is work, and that work doesn’t benefit me at all. If it doesn’t get me code contributions and it doesn’t make my life easier, why should I do it?

It turns out money is a pretty good reason. When people started offering me money for the code, suddenly that work became more interesting. This money for work exchange is a novel concept, but apparently others have tried it before. So I mostly copied what they’ve done and I’m now offering licenses for the code behind my various MySpace feed services. You can buy a license for either the blog feed creator, the comment feed creator, or the event feed creator for $100 each, or you can buy all three for $250.

The license is pretty straightforward, but there is one somewhat abnormal clause that requires you to keep your own copy of the code updated as I release new versions. I don’t want a lot of broken versions of the code out there with my name on it, and I don’t think keeping it working is too much to ask. You can license it and not keep it updated if you really want, but then I’m going to stop sending you updates. So you can either update promptly, or not at all. Other than that it basically boils down to you giving me money and me giving you code.

Proxy Service Optimization

Over the past few days, many people have noticed some problems with the MySpace feeds. That’s because over the past few days I’ve been experimenting with potential ways to speed up these feeds. Some of these experiments have been less successful, some breaking the feeds altogether. But I think I’ve come up with some successful solutions, which I’ll document here.

The big bottleneck in the MySpace feed services, as with any proxy service, is getting the original content. MySpace pages weigh in at around 45kb each. At a dozen or requests every minute, that’s a lot. So step one, taken a long time ago, was reducing that load.

It’s the nature of feeds that they’re requested much more often than there’s actually new content. That’s sort of the point of feeds. Ideally the feeds would only reload content from MySpace when there’s something new to load. There are various methods built into HTTP, e.g. etags, to ask a server “is there anything new on this page?” Unfortunately, MySpace doesn’t support any of them, so the only way to find out if there’s something new is to look at the actual page.

My solution to this was to only look at the actual page once per hour. I figure a one hour lag between when your friend updates her MySpace page and when that update shows up in your feed is okay. If you need to know sooner than that, you should 1) go outside or 2) check MySpace directly. So if a page was already reloaded in the past hour, I use a local copy instead of requesting an update from MySpace.

And this was all the optimization the feed service needed for a long time. But lately the feeds have become popular enough that more was needed. I was saving the cached pages in a database, which held both the content and the time of the last update. Fetching the time was almost instant, but fetching 45kb of text out of a database is a relatively slow process, and due to the centralized nature of a database, a few dozen relatively slow database queries will quickly create a backlog that takes a few minutes (forever in internet time) to clear up.

While that backlog is clearing up, a script is waiting for a response from the database. But scripts take up resources, so they can’t just wait around forever. If the database request takes too long, the script times out, and that’s when you see an error message saying my server did not respond. That’s no good.

So the next optimization step was to move the content from the database to individual files. Files don’t take as long to read and write because they don’t need to be indexed for searching like a database record. And if one file does take a long time to read, that doesn’t slow down all the other files as it would in a database. I still kept the update times in the database, because those need to be requested by URL, and I didn’t want to deal with working out some sort of URL-to-file-name mapping.

So that sped up things quite a bit, but I’d still notice a bit of backlog in the database requests every now and then. Why was it taking so long to read and update a simple time in the database? It turns out it was taking so long to find the record of a specific URL.

I was saving the URLs in a TEXT field in the database. I knew VARCHAR fields were indexed faster, but they’re limited to 255 characters, and I wasn’t sure if I’d be getting URLs longer than 255 characters. So I used TEXT, but I created an index, so searches would be faster. The additional time required to update a TEXT index on adding new URLs was apparently more than the time saved in searching, though, so the index actually slowed down the database even more.

So I revisited VARCHAR. Some quick research showed that, out of the 21,000 or so URLs I had cached, only one URL was longer than 255 characters, and it wasn’t even valid. So my third step, after creating a cache and then moving the cache content from the database to the file system, was to index URLs as VARCHAR instead of TEXT.

Now everything seems to be nice and speedy. There’s still a lag when a page actually needs to be requested from MySpace, but for all other requests, the feeds seem to be loading almost instantly. There’s one question I was asked several times by people who noticed the feeds were slow and/or broken: why don’t I just spread out the load by giving out copies of the code? That would certainly remove the need for optimization. But I actually have a few different reasons for not doing that, which I’ll save for the next article.

MySpace Comments Feed Creator

A few different people have notified me about a bug in the MySpace Comments RSS Creator that caused some comments to not show up in the feed. The problem was with comments that had “Currently Online” notices in them. This was kind of difficult to track down because it only showed up when the people leaving the comments were online at the time the comment was being converted to a feed. But it’s fixed now.

Along with that bug fix, Atom feeds are now possible, and so the name is changed from “RSS Creator” to “Feed Creator,” bringing the comments in line with the blog and event feed creators.

MySpace Feed Creator

Pei Huang helpfully pointed out that the MySpace RSS Creator wasn’t working quite right on some blogs. Apparently some blogs have the time at the top of each post and others have it at the bottom. The tool would work on both, but the post times would all be wrong when the time was on top. I don’t know how if this is something new at MySpace or if it’s always been like this, but the tool will handle both types now.

Also, it can now produce Atom feeds in addition to RSS feeds. If you don’t know what that means, you probably have no reason to care, but the name has been changed accordingly. As always, please let me know if you notice any problems.

LANrev OSVersion in Database Export

At my day job I’m working with a database export from a LANrev system. If you don’t already know what LANrev is, you might as well stop reading now. I’m just putting this up on the web because I couldn’t find it there when I looked for it, and hopefully this will save someone the trouble of contacting LANrev support.

Still here? Okay then. Many fields in the database tables LANrev exports are numeric indexes to other tables, which are themselves short lists of values. For example, in the software_info table, there is a field named OSPlatform, but rather than actual platforms (e.g. “Mac OS X”, this field just contains a number, which can be used to lookup the actual platform in the table named enum_osplatform.

And that’s all well and good until you get to the OSVersion field, see it contains numbers (e.g. 4167), and look for the corresponding enum_osversion table. But there is no such table. So then you search through other tables, thinking maybe the table is just poorly named, but again, no dice. So then you search online. Previously, you would have found no explanation of this mystery online, but now hopefully such a search brought you here.

The secret? That number, 4167, is the actual version number; it’s just in an obscure format. Specifically, it’s a decimal representation of a hexadecimal representation of three decimal numbers. It’s all obvious now, right? Hopefully an example will clarify. The hexadecimal value of 4167 is 0×1047. The first two digits of that are the major version number, followed by the minor, and then the really minor or “bug” version. That is, put in some decimal points and 0×1047 becomes 10.4.7, the actual version number in it’s familiar format.

Why does LANrev do this crazy conversion? Apparently because Apple does this. Why does Apple do this? I have no idea. To save 2 bytes? And why doesn’t LANrev document this? Again, no idea. But it’s online now. Hopefully it will help someone.

hAtom Makes rel-nofollow Irrelevant

Of all the microformat specs, rel-nofollow has always been my least favorite. A lot of people don’t like rel-nofollow because it makes publishers even more prone to treat links as an implicit endorsement or agreement. And I share that concern, but my main problem with rel-nofollow is the first open issue on the microformat wiki: nofollow indicates a behavior rather than a relationship from which the behavior should be inferred as appropriate for the useragent. HTML is for describing content. JavaScript or some other programming language should be used for describing behavior.

Philipp Lenssen recently suggested removing rel-nofollow from links after a certain amount of time, in an attempt to solve another problem with rel-nofollow: it’s generally used much too broadly. Because few of us have time to inspect every link submitted via an open comment form to see whether or not we want to apply rel-nofollow to the link, most publishers just apply it to all unverified links. Philipp’s logic is that a link published several days ago can be assumed to be okay (i.e. not spam), so we can go ahead and remove the rel-nofollow.

When I read this, I suddenly realized that rel-nofollow has already been made irrelevant by hAtom. Comments marked up in hAtom already have machine-readable dates attached, so consuming applications (e.g. search engines) could makeĀ  date-based decisions about whether or not to follow the links. Those date-based decisions would be better at accurately excluding spam (and only spam) than the current decisions based on overly broad rel-nofollow markup. And as an added bonus, publishers get all the other benefits of hAtom, e.g. free comment feeds.

Playing Here

Playing Here is a website I made based on the MySpace Events Feed Creator. It uses the same process to index MySpace pages for events, but instead of creating feeds by band, it creates feeds by state, city, and venue. It also has a microformatted web interface, email subscriptions, and JavaScript includes for the events. E.g.: