PyCon 2014

Analítica predictiva en tiempo real con scikit-learn y RabbitMQ

Michael Becker  · 




Extracto de la transcripción automática del vídeo realizada por YouTube.

hello everyone we're gonna be hearing from Michael Becker and his topic is real-time predictive analytics using scikit-learn and rabbit interview thanks everybody so thanks for coming out my name is Michael Becker I work for the data analysis and management

ninjas at a Weber Aweber is an email service provider located just outside of Philadelphia with over 120,000 customers if you've never heard of Aweber before that's cool we're the company with the empty booth downstairs so anybody else from the

Philadelphia area okay so I'm also the founder of the data Philly meetup with over 600 members if you're not already a member well shame on you so you can find me on twitter i'm at becker fluffle my blog is Becker fawful calm and I'll probably

post a blog post on my website with follow-up material oh and also you can find these slides and other material on my github so working on the damn team at a Weber I've introduced several predictive algorithms into our application my co-workers think that

I'm some sort of math superhero reading scholarly papers by day and fighting crime by night the truth is much more sinister fortunately for me I don't have to be a math genius to look like a superhero I just have to use scikit-learn so while I'll

cover some math in this talk I'll mainly be keeping things high level and this is because I'm not a math genius the developers are scikit-learn our math geniuses and if I get any the math wrong feel free to call me out on the Internet so this talk

will cover a lot of the logistics behind utilizing a trained psychic learn model in a real-life production environment I'll start off by giving a brief overview of supervised learning machine learning and text processing with scikit-learn I'll cover

how to distribute your model I'll discuss how to get new data to your model for prediction I'll introduce RabbitMQ and why you should care I'll demonstrate how we can put all this gather together into a finished product I'll discuss how to

scale your model and finally I'll cover some additional things to consider when using scikit-learn models in a real-time production environment so the demo in this talk I I'm gonna demonstrate a supervised learning algorithm so let's start it off

by defining what the training process looks like for a supervised model so you start off with some input that may or may not be numerical for example you might have text documents as input you also have labels for each piece of training data you vectorize

your your training data which means converting it to a numerical form then you train your machine learning algorithm using your vectorized training data and the labels as input so this is all often referred to as fitting your model and at this point you have

a model that can take a new piece of unlabeled dave data and predict the label so again you need to victor vectorize your new data point then you input it into your trained algorithm and your algorithm will spit out a predicted label for this new data point

there's other types of machine learning algorithms but today I'll be concentrating mainly on supervised learning so in this talk I'm going to demonstrate one of the first models I ever created and it's a model that predicts language for input

text of input text and at the time I needed a way to identify the language of content created by our customers so to create this model I use 38 of the top Wikipedia's based on the number of articles and I dumped several of the most popular articles from

each of these Wikipedia's so going back to my diagram from before the first thing we need to do is vectorize the text so to start off I converted the wiki markup to plain text and I had read about this approach online that worked well for language classification

basically it involves counting all the combinations of n character sequences in a data set and these are commonly called engrams and engrams are a lot easier to understand if you visualize them so let me show you an example so to generate this word cloud I

downloaded 6 of HG Wells books from Project Gutenberg and here's what were the world's looks like if we visualize the raw word counts for each word and you can accomplish this with one line of scikit-learn so the size of the word is based on the number

of times a word shows up in the book one thing you'll notice looking at the text is that the word Martians occurs about as frequently as the word words people and time and this is counter intuitive you think that Martians are much more important to world

of worlds than these other words so fortunately there's another algorithm called tf-idf which that can help solve this problem so tf-idf stands for term frequency inverse document frequency and it reflects how important a word is to a specific document

[ ... ]

Nota: se han omitido las otras 2.328 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.