Presentación
Vídeo
Transcripción
Extracto de la transcripción automática del vídeo realizada por YouTube.
please take a seat we're about to make a start and thank you everyone for coming along to this second session after lunch here at PyCon 2014 here in Montreal a next presenter has a PhD in developmental biology from Caltech and he says he likes Python a
lot he currently works at Michigan State University and today is going to talk about data intensive biology in the cloud instrumenting all the things please welcome Titus Brown right all right thank you all for coming so I wanted to start with a few upfront
definitions first of all there's a lot of confusion about what big data means and I'd like to point out as far as I'm concerned it means whatever is still inconvenient to compute upon if you can do the computation easily it's no longer big
data data scientist you may have seen this before it's a statistician who lives in San Francisco and a professor was able to find earlier earlier today by fernando pérez someone who writes grants to fund the people who do the work and so i am a professor
and not a data scientist because i live in michigan and i write grants so that others can do data intensive biology i'd also like to dedicate this talk to Terry peppers so Terry is a friend who helps run the testing and Python birds of a feather he couldn't
be here this year i'm in for the last five years as I've taken on my faculty position he's um he's progressively grown less understanding of what it is that I do um and so every year it's like what I didn't understood even less of your
talk this year than I did last year so it struck me that this up this winter we were stuck I don't know how many of you live in the North the frozen wastes this year we had a lot of snow in Michigan and so I spent a lot of time indoors with my six-year-old
doing puzzles and and it struck me that she was asking the same question that Terry was asking me and I figured that if I could explain what I do it work to my six-year-old probably Terry might understand it also so what I do is I actually assemble puzzles
for a living and and I told this to my six-year-old we're actually working on a puzzle and she looked at me was sort of wide-eyed wonder and said you actually get paid to do that I said well it's a little more complicated I strategize about solving
multi-dimensional puzzles with billions of pieces and no picture on the box but it's still solving puzzles so if you if I give you many many many many many pieces of a puzzle and ask you to come up with a strategy there's sort of three basic strategies
that you could use these are all strategy used in genome assembly which is what I actually work on and the three strategies are a greedy strategy where you say hey this piece sort of fits this piece let's mash them together which has some obvious flaws
as my six-year-old has found out um N squared do these two pieces match how about these two pieces how about these two pieces and then the Dutch approach and I figure for the for Pike on the Dutch approach is obviously going to be the right one and so I thought
I'd try and explain it this way so the Dutch approach is also known as de Brian assembly and the idea behind it is that what you do is you you decompose each puzzle piece down into small patches and then you look for similarities among those patches and
essentially with these patches that you decompose things into can be easily hashed and compared in a hash table which actually ends up making turning everything into a linear problem and so you're finding these similarities within your puzzle pieces and
algorithmically it's pretty awesome it's linear in time with the number of pieces which is way better than N squared right however it's also linear in memory with the volume of the data and this is largely due to errors in the digitization process
if you have small errors when you're reading the pieces into the computer memory those errors will cause problems with the hashing and you end up having to keep track of all of the different tight little patterns that you have and so this is basically
the problem that we've spent five or six years in my lab trying to solve just to show you the practical effects of this for about five hundred dollars of sequencing data today we would require about a hundred gigabytes of RAM in order to to put that puzzle
together to put these sequences together and so this was a real problem a couple years back it's still a real problem within the field although we've prevent we've provided some some approaches that help deal with it so our research challenges
the research challenges that we talked on my lab but right now it only costs about ten thousand dollars and takes about a week to generate enough sequence that really no commodity computer can handle it and even really very few supercomputers can handle that
amount of sequence in terms of us doing this sort of puzzle piece putting the sequences back together um and the other problem is that hundreds to thousands of such data sets are actually being generated by biologists on a weekly to monthly basis and so this
is really a really vast data analysis problem sort of the inverse of the particle accelerator problem where you have one big particle accelerator generating massive volumes of data and then thousands of people looking at the data here you have many thousands
of people generating data that only a few people can analyze so the nice thing is that over the last five or six years we've basically solved or at least address to this top approach and what I'm going to tell you about today is some of the outcomes
of our attempts to address this bottom issue which is that we're generating lots of these datasets so our research the computer science side of our research is we've built a streaming lossy compression approach and basically what we can do is we can
read in each piece at one at a time and say have we seen this piece before or not and if we have seen the piece before we can discard it and decrease the total number of pieces we're looking at and it turns out to be a single pass algorithm and it's
really nice in low memory and all of that we've also invested heavily and a lot of probabilistic data structures low memory probabilistic data structures which I talked about last year and we're now to reach the point with our computer science research
where our memory now scales considerably better it scales with the amount of information in the puzzle which is basically the size of the picture which is always much smaller than the number of pieces we have and this I is sample dependent but typically its
[ ... ]
Nota: se han omitido las otras 3.351 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.