PyCon Australia 2013

Diseñando flujos de datos con Celery y SQLAlchemy

Roger Barnes  · 




Extracto de la transcripción automática del vídeo realizada por YouTube.

that I actually I'm here to talk about data processing today thank you all for braving data processing is the last talk of plaque on before the lightning talks I'm gonna try and be engaging enough to not put you all to sleep and if you do know a little

bit about SQL alchemy your celery already I'll try and shout that the big junctures to wake you up for the next section of things so I'm Roger bands you can find me on Twitter Email and I will put my slides on slideshow after today so I'm going

to talk a bit about data warehousing and data integration and I realize looking at the program today in particular there's a lot of people who do data a lot more than I do so in terms of big data and analytics I'm not going to try and cover a lot of

that sort of stuff what I'm really looking at is a framework for doing data processing within a smaller environment it's not to say you couldn't scale this stuff up but this is really talking about an experience I had with taking sort of disparate

systems and managing the data together using a pure Python sort of framework so to do that we're gonna look at a bit a bit of SQL alchemy and also celery which is sort of an unlikely contender here and how I've used that so I've been doing a software

development and other things for quite a while now and I spent 11 years at a business intelligence vendor while I wasn't directly involved with the software they produced I certainly cut my teeth on a lot of this kind of these kind of concepts at the time

I'm currently contracting and this particular talk is actually about a specific reporting system that I've built for one of those contracts so first of all for those who aren't familiar a bit of an introduction to the idea of data warehousing or

data integration because there are lots of different systems in an organization you might have different data quality levels different data life cycles different departments collecting and doing different things on all sorts of different different levels what

that means is you end up with a very disparate set of non heterogeneous data and it's also potentially used in different ways which is ambiguous and not actually consistent across an organization what we really want is a some sort of centralized reporting

that is timely that is accurate unambiguous complete and ideally isn't impacting production systems a lot of this is about making sure that you don't have the Excel Empire up there where Bob in marketing and Jane the IT director aren't getting

their little bit of data from their little spot and they might not be they might be hitting that the systems too hard or they might be getting in completely wrong and actually reporting on data that they thought was accurate or complete when they're not

getting the right information at all so looking at a data warehouse I'm sort of at a top level it really is just a central repository of data that integrates all of these disparate systems together beneath all that there's a huge amount of theory and

really good practices in data warehousing that I'm not going to go into today but that concept of basically integrating everything together is what I'm going to focus on this particular slide isn't designed to be readable so I'm not going to

apologize for it yet this is a slide from the school of data it basically shows the different skill sets involved in what's it called data processing pipeline and basically there's a very large range of things from understanding data analyzing it even

things like governance and compliance and so on what I'm actually going to focus on is really that middle bit of this diagram which is about the extraction and transformation of data or in other words this is a bit of a nicer diagram I think extracting

from multiple systems transforming that through a staging layer and then putting it into a central database for consumption so we're all here for Python Python can help do this it's really good I think I'm probably preaching to the converted here

really good for rapid prototyping lots of potential for code reuse there are a ton of libraries out there so whether you're parsing XML or scraping webpages reading CSV files or reading out of databases there's likely a library out there or somewhere

to get to that data finally pythons quite good for decoupling so actually keeping your flow the data flow that your are you're dealing with under management doesn't have to be tied to the actual processing and the business logic doesn't to be in

the same place as well so good design allows you to separate those three concerns quite well in terms of existing solutions I sort of did a survey of this before I embarked on this project and there's not a whole lot in the Python space around any business

intelligence there are there are some things out there and I'm happy to hear about more of them if I've missed any people do tend to roll their own and quite often that's because the business problem is often quite different and often the domain

or the scale or the shape of the data really informs what decision you might make about a solution there is recently released a framework called bubbles or brewery - data brewery org is where you can find that there is a link in a resource slide at the end

it's basically a framework that lets you build up operations and patterns that's that you can string together in a pipeline so it's not unlike what I'm going to talk about I haven't dug very deeply into it though so I can't really give

you a comparison at this point I'm gonna call quickly about ways to move data around often the source data you have you have no choice about but once you have your data you could be looking at things like flat files being CSV XML or what-have-you there's

[ ... ]

Nota: se han omitido las otras 2.876 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.