Database Optimizations

When we first started working on ridewithgps, we made the decision to store all route information as JSON text inside the database. The downside should be pretty obvious, in that a route or trip can be a fairly large string; I have some trips saved from my GPS unit that are 1.5 megs of xml and around 600kb of JSON. To cope with the performance hit when using activerecord to pull routes/trips from the database, we used a plugin called attribute_lazy that delayed loading of the fields until they are explicitly queried for. However, this adds an additional layer of metaprogramming complexity, and turned out to be a pain to work with in certain situations.

The final straw was an ever-growing database size, which resulted in database dumps being 3GB with our current ~22k routes/trips. Considering we need offsite backups as well as an easy way to bring the dataset in locally for development purposes, I desperately needed to use rsync for incremental and efficient transfer of only the modified and/or new routes!

There was one thing to think of when deciding on this solution: filesystems use a single inode per stored file, and you can actually run out of inodes before you run out of disk space, depending on your inode size and the size of the files you are storing. Since each route and trip will take about 4 separate files (I store reduced versions of the route for low resolution maps which need fast load times), I did some calculations based on having around 120 million free inodes. Without considering other files being uploaded, such as images, I have around 30 million routes/trips I can store without running into the inode limit on my current 1 terrabyte RAID10 array. If we don’t have the money to expand storage options with another couple million routes/photos, then we are doing something wrong.

This task was fairly straight forward with one exception: storing a file corresponding to a particular trip or route requires the ID of that trip or route in the database. This ID is not available until after the database entry is saved, so we have to persist the JSON until we have that ID available. The canonical rails way of doing this is of course using the after_save method. I keep the JSON around by shoving it into an instance variable on the particular route/trip instance we are saving, then clearing that variable when the file has been written.

def data_full= json
  if self.id == nil
    @data_full = json
    return
  end

  if json
    File.open(df_path, 'w+') { |file| file << data.to_json }
  elsif File.exists?(df_path)
    File.delete(df_path)
  end
  @data_full = nil
end

A simplified version of what we are using is above, which is loaded into routes and trips using a module. Basically, we first check if the ID exisits and if not, populate the instance variable with our text. If it does, we open the flatfile in truncate mode (overwrite if it already exists). If a nil value is passed in, we delete the file if the file exists.

The after save method for routes and trips is trivial as well, since we just call the assignment again. On this second call, the assignment falls past the conditional check for a nil id, and the file write occurs.

def after_save
  if @data_full
    self.data_full = @data_full
  end
  #some boring stuff also happens...
end

The final bit of the puzzle is removing files if a route/trip is destroyed, which is pretty trivial. This will greatly improve throughput on heavy loads, which will be welcomed when we get a huge influx of users next riding season. The best part? our 3GB database dump was now whittled down to a mere 26MB! We can easily hit 100k users and 500k routes/trips and not break the 1GB mark, which means database scaling will only be for performance rather than space considerations. The directory structure containing the flatfiles is currently at 2.6 gigs. With rsync, backups are now speedy and efficient.