PyCon 2014

Comparando el rendimiento de las diferentes librerías de scraping para Python

Katharine Jarmul  · 




Extracto de la transcripción automática del vídeo realizada por YouTube.

okay ready okay hello everybody up next we've got Katherine Jamal who's going to give us a performance and accuracy with you of the top Python scraping library soon please welcome her hi everybody I'm here for Python scraping showdown hopefully

you are too if not please see the directory in the hallway my name is Katherine jarmel also known fondly as cage am a little bit about me I've been using scrapers since 2010 after a shish inspired me I went to his crib you tutorial if you get bored of

this talk he's giving a rather fond talk with Karen over a couple rooms down evidently my google delay is a little delayed ah a little bit more about me as I am the PI ladies co-founder or one of the pie ladies co-founders I'm happy to see a lot of

pie ladies here with me today try to move us along apologies evidently everybody is on the same network imagine that let me see if I can so maybe apologies again well Mac you have failed me not sure if there's a better way to go about this um yeah sure

sorry everyone ok all right let us see if I have better luck here apologies again okay so yeah all right let's get started here apologies again for that delay hopefully I can speed us up a bit and get us going okay so I am a pilot ease co-founder I am

relocating to Berlin in a couple of days so if you live in Berlin or know somebody who I should meet please come say hi so why scrape the web this is becoming more and more of a question there's a lot of public api's there's lots of rest or JSON

enabled endpoints both public and not so public but easily found upon investigating websites there's also a lot of well maintained open-source API libraries so those you know again allow us to scrape Twitter Facebook whatever it is we're attempting

to use without really using the traditional scraping libraries for Python selenium is still your best bet if you need anything that does JavaScript interaction or that's loaded after the Dom so you're kind of a little bit shackled there there's

plenty of node libraries out there but in Python that's really all you have and i'll be it with all those caveats i still think that there's a lot of sites where you can build a traditional scraper and scrape content that way and i still find it

very fun to build those for my own use so what this talk is going to cover I'm going to look at different performance within lxml vert and beautiful soup I'll go into a little bit more why I narrowed it down I'm going to talk about finding elements

with selenium so a lot of us who've used selenium there's many options for finding elements I'm going to talk about which is the fastest and I'm going to take a look at scrapie or scrap hi I'm not quite sure the pronunciation and see how

fast we can go with Python and scraping the web so a bit of a disclaimer when I first was putting together to the test for this I want to screw use quite a lot of scraping libraries I found as I began to look at it is a lot of them are using similar dependencies

and so I decided that it might be best used to hone in on some of the most popular ones and most widely used ones which is what led me to lxml and beautiful suit I also in this I kind of wanted to find some broken pages I've been scraping using lxml for

some years now and it often would happen to me where I'd come across a page where beautiful soup was truly my only option to accurately parse the page I think html5 is changing this landscape and hopefully allowing us to have more standardized web content

which allows us to scrape the web utilizing the element tree api's and other things similar to that if you find of course broken pages beautifulsoup an html5 lib also have quite a lot of helpers for scraping you know truly broken pages that don't follow

[ ... ]

Nota: se han omitido las otras 1.944 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.