Fine tuning Airbrake for Resque at Ticket Evolution.

Hi, my name is Spike Grobstein and I'm a Sr. DevOps Engineer at Ticket Evolution; since 2010,
we've been a SaaS provider that handles real-time execution, clearing and settlement of event ticket trades. If you ever purchased a ticket to a sports game online, or called a broker for seats to a concert, chances are our system was used somewhere in between.


At Ticket Evolution, we have a lot of background processes that we need to keep track of. Be they
Resque workers munching through hundreds of thousands of jobs or scheduled tasks being run throughcron, when companies are depending on your services to be accurate and available, silent failure is not an option.

The biggest pain point for being notified about failure in our system was cron. Although all of
our cron tasks are run through rake, we have a multitude of applications spread across servers and we need to know when they fail, what caused them to fail, what the state was when they failed, and what exactly failed. For this, we built a rake wrapper that I named timevault that will detect when a rake task fails by inspecting the exit code and notify us via the airbrake gem.

Until now, we simply ran the rake tasks like so:

* * * * * cd /path/to/project/current && bundle exec rake some:task --trace 2>&1 | logger -t some_task

This was fine for when we would need to see why something failed, but it didn't let us know when
it failed and didn't give us a pretty interface for finding what went wrong. As everyone knows,
grep'ing through logs is only pretty for a select few.

Requirements for timevault's Airbrake notifications were the following:

* Which rake task failed?
* What was the exit code?
* What was the output of the run?
* What was the stack trace (the tasks should be run with --trace)?
* Which environment failed (production, sandbox, staging)?

An early version of this Ruby script did something along the lines of the following:

Dir.chdir(project_directory) do
line ='/usr/local/bin/bundle', 'exec rake :rake_task --trace 2>&1',
:logger => @logger

@output = => rake_task) @output
rescue => e
@logger.fatal "An error occurred:"
@logger.fatal e.message

timevault was designed to be called like so:

    timevault `path_to_project` `rake_task`

So it would Dir.chdir to the project and run rake with the passed task name.

This caused issues because each Airbrake notification would stand alone. Airbrake had no idea
how to group the errors, there was no metadata in them, and the title of the error was the entire
message from the exception. If we had a cron job that ran once a minute and failed every time, we'd be receiving 60 airbrake errors per hour. This would not stand.

This is where we began to pick apart the airbrake gem to see how it ticks and to see what exactly we could get away with and force my caputured data into the Airbrake messages. Since the rake task was run with --trace, failure would mean that it would spit out a stack trace. We also have the rake task and project path on-hand, so once we parse the output, we should be able to properly
report to Airbrake.

After rummaging through the airbrake gem's source and seeing which parameters we could set and a seemingly endless task of sending various types of contrived rake failures through this script, we came away with the following code in the rescue block:

    rescue => e
# log the error:
@logger.fatal "An error occurred:"
@logger.fatal e.message

# pull the backtrace off the output, which is found after a double n:
backtrace = e.message.split("nn").last

# notify airbrake
rake_task, # the name of the rake task (some:task)
:error_message => "#{ rake_task } FAILED", # the title of the error
:error_class => rake_task, # for grouping

# build the backtrace; only lines that don't start with a leading /
:backtrace => backtrace.split("n").delete_if { |l| l.strip.length == 0 || !l.match(/^//) },

# the environment name, defaulting to development
:environment_name => ENV['RAILS_ENV'] || 'development',

# some extra metadata:
:parameters => {
:output => e.message, # the output of the command
:base_dir => base_dir, # the path to the project
:rake_task => rake_task # the rake task in question

The hardest part was figuring out exactly what values should be put into which fields to get the
desired result in the Airbrake dashboard and in notification emails. It wasn't the most straight-forward task to decide to use environment_name over rack_env. Also, knowing exactly how each field translated to the fields on the dashboard required a bit of trial and error.

There is still some work to be done on this, but this is where we're at, now. When reporting errors,
they're tagged with the correct environment name, they have a well-formed backtrace, and the
title and class of error directly correlates to which rake task it was that failed. Airbrake then groups similar errors as expected.

Missing functionality includes lack of what the exit code was of the task (although we do have that
as part of the output field) and we don't yet notify Airbrake of our deployments to reset errors like we do with the applications themselves. On Airbrake, we have a project dedicated to Cron, but in the future, we plan on merging each individual job into the project that it belongs.

Overall, this was a great learning experience into the internals of how Airbrake handles reported
errors and it solved many of our issues, also greatly simplifying how we call rake tasks.

This was a guest blog post by Spike Grobstein a Sr. DevOps Engineer at Ticket Evolution. If you would like to guest blog, please contact