Presentación
HTML (pincha para descargar)
Vídeo
Transcripción
Extracto de la transcripción automática del vídeo realizada por YouTube.
hello there you'll hear me okay good a couple things I wanted to say before I started first I wanted to say I was very glad that there was actually a break of an hour lunch between the last talk in mind because I don't think that my comedic timing
is going to be quite a spot on is that so a little bit of space from from that act to follow Thank You Kelsey for the introduction and also just to say that it is quite an honor to be up here it's a bit humbling to share this stage with some people whose
names I've been following for a long time and I don't consider myself much of an expert in anything but I have been playing around with go and been trying to do some data snarfing and data crunching for the last year or so and going to share what me
and my small team of folks been working on hopefully find something useful and enjoyable in it so this is adventures in snarfing and crunching starring hecka i am an engineer from for mozilla cloud services and as Rob Pike was saying earlier today we used
to just be the Mozilla services team but now we're cloud services because that's what you do these days interesting thing about cloud services is that even though it's cloud you still need cloud servers so yes we have a lot of them some of them
are physical some of them are virtual but we have to have our services run somewhere support our phone operating system in our browser and these cloud servers they produce a lot of cloud data and this was a problem that i started to tackle about a year and
a half ago one of my bosses came to me said Rob you know we're writing all these services they're producing all these data they're running on all these servers which are also producing all these data maybe you want to take a look at that problem
and see how we might deal with this fire hose and make some sense of it there are many flavors of cloud data or as we like to call it data running on these servers we've got log files and among the many flavors of tools they also have many flavors of data
yet many tools to deal with this data so you've got kind of the general logging problem space with tools like our syslog-ng and log stash and splunk and etc you've got the kind of numeric time series data toolkit stats d graphite ganglia etc you've
got kind of sliding window common event processing simple event processing things anomaly detection real-time processing I got monitoring tools for when you're doing your real-time processing and you notice that something needs to someone needs to be notified
about something and the tools that go along with that nagios enos etc got app instrumentation you've got new relic tools you've got kind of bespoke timing and other data gathering the engineers are actually putting in your application service code
we've got metrics that the server's themselves are generating the stuff that your ops people care about how much RAM am I using how much what my file systems look like what are what the cpu load look like etc and etc and etc and etc there's a lot
of different types of data so when we were looking at this problem in the general sense we realized that even though there's lots of different types of data all of this adheres to basically one one simple pattern you get data you transform and or transport
that data and then you deliver it and that's obviously a bit reductive but if you look at all of the different tools that we saw in this slide they all basically follow that set of steps so we started by kind of playing around with some of those tools
and experimenting with them and we found some interesting ideas in there and we actually decided to take it a little bit farther and drill into more detail on this set of on this pattern rather so drilling into that those three steps in a little bit more detail
when we talk about getting data the first three here have to do with that we're getting data is first accessing some stream of bits from somewhere be it over the network on the file system generated by a process that's running on your machine some
stream of data coming in then you want to identify and split that data on record boundaries now you might be having UDP packets and it's like one record per packet that's a simple trivial case but you might have streams of data coming in that have
specific framing around them and that's a particular thing where you need to divide it up into the records and then if you're going to do anything meaningful with this data you need to parse those records and do some decoding and convert them into
a common format that you can actually work with inside of your tooling once you've got this common format then there's some work that you can actually do you can route your records to the appropriate consumers some of those consumers are going to be
doing crunching and processing on this data in real time and as you process the data you might be generating new data generating new messages again in that common format because we're kind of inside our controlled zone here but then eventually you do want
to talk to the outside world again so you start generating new messages or I'm sorry generating from the common format to some other serialized format so that then you can finally push it out to the world again in some form or another that is the basic
design that we ended up putting together for hecka hecka just as an FYI specifically targeting the problem doing real-time data processing grabbing data pipelining it letting you examine it as it flows and then it's meant to interoperate with and deliver
to data warehousing solutions and you know offline analytics but it's not meant to replace those so it would be used in conjunction with some kind of big multi terabyte back end place where you might be pushing your data but you notice this first input
splitters decoders that talks about that data input process and we found that you know some of the other tools that we were looking at had this idea of like breaking a pipeline down but we refined it I think to what's a pretty good sweet spot at being
able to to manage these things in a very reusable reusing the code well way you know one of the tools we looked at originally was log stash raise your hand if you use logstash logstash is a great tool solves some really wonderful problems we actually started
out prototyping the system using log stash and we stole a lot of really good ideas from it it wasn't up to what we wanted to do with it unfortunately as far as scaling and you know we looked at trying to improve it and we didn't think that it was actually
[ ... ]
Nota: se han omitido las otras 3.243 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.