PyCon Australia 2013

Detectando duplicidades en Big Data con Python

Andrew Rowe  · 

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

hello our next presenter is a senior developer with the Australian Bureau of Statistics with more than 20 years computing experience he's presented before at PyCon Australia 2012 and he's here today to share his experiences in big data deduplication

and data matching using Python please welcome Andrew roe thank you for the nice introduction first I'd like to start off a little clip which sort of shows you some of the issues we're trying to deal with here in the Stranglers statistics so where we

go is this sound' ordered off because have you got the sound turned on okay hold on it's a lot funnier with a sound I got a plug Dean I can hear it on my computer Malkovich Malkovich Malkovich Malkovich Malkovich Malkovich Malkovich Malkovich all right

so that's good so that's a little bit of the difficulty we have in Australia statistics of telling people apart you know from afar they will look the same all right let's skip at the presentation okay I'm going to talk about big data duplication

and data matching using Python so this is where I work ABS house we have been stating the best ways to get a six-pack since 1905 and I think some of our size has shown the best way to get a six-pack for most people is texture that's not quite true actually

this frame blur statistics is our national statistical agency we're responsible of sensors CPI unemployment balance of payments and so many other statistics just wouldn't believe it if you came to our site just be blown away okay so I've been working

last few years on protocol census data matching in it sort of that so overview where we're taking data from the Census and linking them together with other things so we can have her like an enhanced in so we've been linking the census to indigenous

deaths and we were going to do Australian cancer database but I've been told just before I left here that actually we'd worked so there's still information our websites I don't believe that you see it but we're doing other things so nothing

we do is we take the data from census and then we have another survey after the census to see how accurate the census was which is called the PES or the post enumeration survey and the other part of being is detecting duplicates so just to make sure that you

know that confidentiality is extremely important to distrain below statistics to ensure everyone loves filling out forms and can make sure that we don't fib and lie and tell every one of what you're doing so during the census processing it happens

a certain time after the census when you've got your forms then we have your name and address and in the processing period which was just at the end of last year we take all the paper all the information all the data sets every bit of computer code every

single thing which has your name and address and we pulpit paper gets popped and the data sort of gets disappeared then all the machines were used are wiped out and the data center is closed down and everything gets moved away so you can make sure that none

of your name address is ever available for anyone to do anything nefarious to and that means that when we're doing things that we need to do with your name address we've only got a legislative defined period and we have to finish by that time we can't

just go to like two minutes to midnight say no no no no keep it there for another couple of seconds I'm almost done it's just that it's gone so so linking since it's from the few times we have actually information every person in Australia

well just about everyone there's a couple people always seen them yourself and identify is we carefully store it like I said before on a dedicated server away from the idea so there's a server down in service center down in Melbourne which is separate

from our corporate infrastructure and only certain people are allowed to look at it and it's not connected to our corporate infrastructure so if you try and go to the services can't you can't copy and paste you can't connect file services so

it's quite lockdown okay so what do we do with me linking why would we do that so what we want to do is one of the examples in linking is that we want to check school leavers how they're progressing so if we can sort of look at so we want to see we've

got a survey of school leavers and we want to see what happened to those people in the census so what we do is we go and get the name and address from the census and an AM addressed from the school and we match them up and then we can see where that school

leaver was when the sensor was taken undertaken and we can sort of make a lot of information about that okay okay so this is the interesting part did you know that pythons actually calls 250,000 deaths in Australia last year I'm sure you didn't know

that I thought you might have been out Redback spiders or something like that that it's actually pythons and listen because of that is because of what we call the post-enumeration survey so after the census we go back and we have another survey which is

like the census but it's more intensive so we pick people who are representative or census and we survey them and then we go back and we try and match that person in this survey into the census and this gives us a good indication how accurate the census

[ ... ]

Nota: se han omitido las otras 2.635 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.