PyCon 2014

Codificación de caracteres y Unicode con Python

Esther Nam , Travis Fischer  · 


Extracto de la transcripción automática del vídeo realizada por YouTube.

I'd like to introduce Esther and Travis when we talk about Unicode and character encoding this morning hello PyCon I'm Travis this is Esther we're very excited to be with you this morning I dreamt last night that we couldn't figure out how

to get those slides up so I'm feeling really good about PyCon so far Esther and I've worked at new cars' comm together for the last couple of years Python shop PyCon sponsor users of pyramid Esther still with the great team at

I'm with another Los Angeles Python shop shift comm so I want to introduce you to Kody Kody's experiencing pain is experiencing frustration confusion so why's Kody experiencing these things well he's dealing with the heartbleed bug on his website

but on top of that he's trying to write software that deals with character encoding and Unicode and if you are anything like Esther myself or Cody it brings about these emotions when you're trying to write software that deals with character encoding

and it may result in this the legendary double table flip so we're hoping this talk will help alleviate some of that pain for you speaking of flipping tables we have a confession to make we chose the title of our talk not just because it is sort of silly

and fun but because we're evil as the Python program committee found out when our talk title crashed the IRC BOTS that they use when they review proposals we really drove home the point that this is a topic that's very relevant to anyone who has ever

had to deal with text character encoding and Unicode which is all of us at some point in our careers before continuing I have to mention that we're not going to talk about Python 3 in our talk we do want to cover the fundamentals of text characters character

encoding Unicode and strings so that you can understand at a fundamental level how those things work and that will make it easier for you to understand the differences between the way Python 3 and python 2 handle text and character encoding but we just don't

have time to talk about this today so we're gonna start off with the fundamentals and I'm gonna mention Ned Batchelder gave a fantastic talk Pike on 2012 on the fundamentals of Unicode during coding it helped Esther and I understand it a lot so I recommend

go check that tuck out the video as soon as you can if you haven't seen it we're gonna try and build on that and take you a little further so fundamentals humans use text to communicate right we use written language we use characters the computers

speak bytes computers on a fundamental level just deal with binary so how do we deal with that well we as programmers need to translate our written text into binary we use something called character encoding you guys know this right we take a character like

the lowercase Latin a and we just map it to some binary representation we assign it some bits this is the ASCII encoding for the lowercase a so there's a bunch of different character encodings out there they're all unique they do things differently

they do try to agree and have some compatibility for like your basic a to Z Z that's represented in ASCII but because they're unique encodings they differ a character like the Euro symbol is represented with different binary representations in these

various encodings and the result of that is that the same binary sequence may represent two different characters depending on what encoding you are using and this is the cause of a lot of pain let's think about all the characters that we like to use in

our written communication that don't lie under sorry don't that don't fall under the standard a disease or nine range that ASCII presents us we like to use mathematical symbols punctuation marks and even Unicode snowman and of course there are

other languages in English that have their own scripts some of them have dozens or even hundreds of characters that also need to be encoded so Unicode was a way to standardize the way that we can encode any character that anyone might ever want to encode the

way Unicode does this is by providing a universally recognized mapping between a character and a Unicode code point this is a universally recognized identifier for a character it's denoted by a capital u and a plus sign followed by a unique hexadecimal

value - signed uniquely to that character what you should note is that Unicode does not provide a binary mapping for a character code points are purely an abstraction layer it's limited to the character itself it's wholly separate from any binary representation

it's purely a description for a character of course ultimately computers you need bytes to deal with text and transmit it and store it so what we need is a way to map between a Unicode code point for every character to that binary representation so what

we need is a Unicode aware encoding a Unicode transformation format or a UTF of which there are several but the one that you should use wherever you can is utf-8 for many reasons it's become the de facto standard used across the internet used by email

clients internet browsers and operating systems so Esther just explained Unicode in utf-8 right these are highly related terms but I just want to drive home the point they are unique ideas and you shouldn't just blindly use those terms interchangeably

Unicode is code points unique identifiers that are universally known and accepted for characters utf-8 utf-16 these are binary encoding for those code points so just keep those separate in your mind so oh sorry I'm let's talk layers of abstraction

let's put this all in a little bit of context right on a high level I'm gonna use the computer my clicker stuff working on a high level you deal with the display layer right you show characters on a screen you print them the stylized character is called

a glyph you render it using font rendering etc below that you have the text layer of abstraction this deals with the elements of our written languages this is what we as humans deal with right this is what Unicode and code tries to address it gives unique

[ ... ]

Nota: se han omitido las otras 3.003 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.