Big Ruby 2014

Cómo escalar una aplicación Ruby para procesar 1 millón de tareas por hora

Tanner Burson  · 

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

alright so thanks for the introduction mark as he said I'm here to talk about job processing I think this is a really exciting and interesting topic that doesn't really get the press it deserves in the Ruby community everybody knows about rescue everybody

knows about sidekick but much past that it just kind of drops off into oblivion so this is stuff I've been working on pretty much for the last year so I'm really excited to get a chance to come talk to everybody about this I'm excited to be here

with an audience of actual people today in preparing this talk I've given it at home I don't know how many times yet but i think my dog is sick of hearing about job processing so i'm really glad to have people here to listen so obviously I'm

tanner I work for Tapjoy out of our Boston office for those of you not familiar with tapjoy we're a in app mobile ad platform so what does that mean we have an sdk for mobile app developers that allows them to put ads offers other monetization options

into their application so that users have ways to pay them that aren't just give me cash along with that we've got about 450 million monthly active users this is a big number because we're an ad network that means we're dealing with lots and

lots of requests millions and millions of requests a minute things get hard so what does that really boil down so we're talking about scale right this is big ruby conference we're here to talk about scale let's talk about scale this is this brings

up interesting problems of scale right so anything we put into production goes out to this load immediately that means the naive solution fails not in weeks or months the naive solution fails in minutes we have to avoid single points of failure we have to

have distributed systems from the start we have to think about hard things upfront immediately every time I think this is really fun I think these are really great problems to get to go solve every day and I'm excited to get to share some of these things

that we've been doing so let's set the scene a little bit so this this all started way way back in 2012 so what were we all doing in 2012 when we were upgrading our apps to Ruby 19 we were watching gangam style mashups and waiting for the Mayan apocalypse

outside of those things things were going you know really pretty well I mean everything was good we were pretty happy with with the way things were mostly you know good news we're an ad network our traffic is going up way up we're seeing month after

month quarter-over-quarter huge increases in traffic fifty plus percent increases in traffic inside of six months pans we're moving really fast we're scaling up really big this is great for us because more requests means more money and we like that

but not everything is perfect so our job system is starting to to kind of show some issues predominantly during our high-traffic periods jobs are backing up and not just a little bit jobs are backing way up we're seeing ten plus thousand messages and queues

that aren't clearing out they'll stay that way until our peak you know our peak traffic starts to trend back down and finally we can get all caught back up this is this is not awesome and then secondly we have kind of an event processing system similar

to what Coraline was talking about just a few minutes ago and much like she was saying these are important things to us this data is really useful but we've discovered that we're losing small amounts of this data we don't know how much and we don't

know exactly where in the pipeline we're losing it but we're dropping some of it and that's real bad and for the worst news all of this is our fault because the the job system and the event processing system are entirely homegrown emphasis on grown

these are the kinds of systems you discover one day are there and you can tell no one sat down and thought hard and planned and prepared and set up a design review this system just kind of accreted code over years and now it's there in front of you and

it's a problem particularly to the job system we can't scale up individual jobs we can't identify that one job that's going slow and say hey let's make sure we can do more those right now let's task more servers to do that lets task

more processors to do that we can't really do that in the way it's designed and event processing so let's let's talk a little bit about what event processing means to us so for us this these are data points in our application these are things

like a user has seen an add a user has clicked and add if we're lucky a user has converted on an ad and that that means we're you know getting money this this is good stuff this is the core of you know our business analytics our business intelligence

system and we generate a lot of those you know this is on the order of hundreds and hundreds of thousands per minute you know five six hundred thousand per minute lots and lots of these and we're losing a very small percentage of that but we don't

know how many that is we don't know which ones were losing and that's and that's bad part of the reason we don't know is that the the way the system is designed to currently it's it's basically syslog we write messages out to syslog

we have some custom log handlers that bundle and aggregate and do things to the data and then ship it off to our actual you know data warehousing system but sis looks kind of a mess and it's really not easy to tell what's going on in there other than

sometimes you get exit codes that aren't zero and that doesn't tell you much so this is this is not awesome it's it's time for us to do something about these problems we're seeing all this growth we're seeing all this coming up ahead

of us we can't stay this way we've got to fix things but change is hard and where do you even start with a problem this big we can't just dive in and go fix all of this all at once these are these are big systems with lots of data so we kind of

need to figure out what's important to us first and what we really need to be approaching and what we come down to is we're talking about message queues we're talking about job processing and when you start talking about that you're really

talking two components a queue something that takes data in on one side and let's data out on the other side it's pretty simple and secondly you need something to take data out of that queue and do something meaningful to your application without that

it's just data flowing through a queue that doesn't mean anything we need something to make it mean something to us so let's let's kind of break this down a little bit so when we talk about cues there's some choices a lot of choices you

may be familiar with some of these terms I wasn't familiar with all of these I actually googled these the other day and there's just a crazy amount of them i mean i think most people in this room have probably heard of RabbitMQ or have seen Redis used

as a cue or maybe Bienstock d if you're really up on some of the cool things in Ruby right now but there's just a ton of choices how do we even figure out which of these things is useful and meaningful to us so we decide the same way we decide with

every problem in computer science right we get three choices and we only get to pick two of them great we do this every time but here we go again so in the case of cues really there's three things that mattered most than anything else we need durability

we need availability and we need throughput so let's break these terms down just a little bit so durability what would we really mean they're all we mean is I want to put a message in a queue and I want to be reasonably assured I'm going to get

that message back out so we're talking about avoiding you know maybe a disk failure or a process crash or the simple kinds of things that just happen every day in in real systems but we need to have an idea of how many of these we can see in what the likelihood

that these messages survive is next we're talking about availability this is this is really critical we're talking about is this thing a single point of failure hopefully not because we don't want single points of failure we have big systems but

[ ... ]

Nota: se han omitido las otras 4.234 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.