PyCon 2014

Mejorando Hacker News gracias al aprendizaje automático

Ned Jackson Lovely  · 

Presentación

Vídeo

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

jackets on boo and the mics on hi we have a Ned Ludd Jackson again here that lovely Jackson on stage and he's going to talk to us about machine lighting please welcome hi everyone how y'all doing so my name my talk is enough machine learning to make

hacker news readable again yes that's a little snarky I'm sorry so what I want to talk about sort of focus to this should introduce myself to shouldn't I and plug in my pointer all right so this is me oh by the way the slides are up on my website

and jus net jl on the twitters things you can email at me and jay len in jail us too I don't know ask me questions or yell at me or whatever so what try to do here is a simple achievable project something that is machine learning related but I think all

of us are capable of which is a personalized filter for hacker news taking hacker news inserting the articles into what I'm going to call dreck and not direct and then you read the not dreck and you feel much better about the world in general so the whole

point of my talk is I can machine learn and you can too machine learning like has this crazy name like it feels like artificial intelligence like the machine is learning man and that's a lot of intimidation there is a lot of science involved there's

a lot of terminology you're going to spend a lot of time googling going to spend a lot of time reading Wikipedia articles and survey overviews stuff but the essence of machine learning is actually start an approach engineering it started approach tools

you can use I'm machine learning is just applying statistics to big piles of data that's really all it is so we've gave it this wonderful name but when you get down to it you're building statistical models of a bunch of input data and you're

using it to understand that data better or to make predictions or to better or to sort of go further with the data so yeah there's terminology but we're going to I'm going to give you a bunch of it and then try to set you loose so here's sort

of the workflow when you're doing machine learning you have to get the data which is usually not as easy as you think you need to engineer the data which means you need to take this messy real world stuff that you scraped off the web or gut out of some

database and turn it into nice little numpy arrays with well you know with well-formed values and then you need to train and tune the model this is the science part that training and tuning the model but yeah other people have written libraries you can just

use their libraries it's great and then applying the model to new data once again you just use the libraries if you talk to folks who do this for a living who have doctorates who spent a lot of time working on machine learning they spent all their time

getting the data in engineering the data they burn a lot of CPU cycles but not a lot of time training and tuning models and then you know just applying it is very straightforward so I'm gonna be talking mostly about scikit-learn the documentation for socket

learn is fantastic a couple pip installs away from having a fully functioning awesome machine learning library sitting on your machine the hardest part is installed installing them installing syfy so you've got a tutorial user guides api docs in this flowchart

this flowchart is one of the coolest things ever so the hard part with machine learning is there's too much terminology there are too many things you can do there are too many options there are people will say oh you support vector machines for everything

and they're probably right but who knows like maybe you're trying to do something else so what this does this chart gives you some steps you can go through to figure out what you want to start googling all right it's really that straightforward

you say oh I've got less than 50 samples of never mind right and so you kind of step your way through this flow chart try to figure out what you're trying to do their data and you can start figuring out what parts of socket learned you need to research

more what you need to find a couple papers on there's lots of good papers that kind of give you an overview of the field without bogging you down too much and when you don't understand the math go blah blah blah and keep reading that's my advice

so something that we hear a lot about in machine learning and I think it's worth talking about bringing up and just making people aware of is a concept of supervised versus unsupervised so supervised learning is when you have input data and then there's

some output value so for me this was the input data was a bunch of stuff I scraped off a hacker news and sliced at link to hacker news that was the input data and then the output was is this crap or not right so a boolean value true or false true something

I want to read false something I wish didn't exist right and so that's classification that's supervised learning where you go and you have a bunch of labels and output values and you try to turn input data into those labels or output values that's

supervised learning unsupervised learning is more about understanding your data that would be I have you know thousands of hacker news articles can I visualize them somehow can I group them by subjects and I don't know how many subjects there are but can

I find subjects and group them so that's kind of unsupervised sort of trying to understand the date a little bit better a lot of power to it a lot of interesting bits there but I'm going to set that aside for this talk and I'm pretty much when

we talking about supervised learning there are a lot of really accessible books on the topic natural language processing with Python and programming collective intelligence the o'reilly books aren't directly dealing with the topic of this talk but

[ ... ]

Nota: se han omitido las otras 2.857 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.