Saturday UTF8 Reading

Reading about UTF8 on a Saturday afternoon?  Yeah I didn't expect that either. But I was having trouble yesterday at work sending html email. I was getting some mojibake. That's when characters are encoded one way and interpreted in another way. It looks like some of the characters turn into garbage.  For example in Microsoft outlook you get a black diamond with a question mark for some special characters.

I'm learning that UTF8 is really a dominant standard mostly because it can be used to encode all Unicode characters as well as support good old ASCII. ASCII is UTF8. UTF8 can do these tricks because it can encode characters with a variable number of bytes. For common and ASCII characters it uses just one.  For special characters it might use two or more bytes. 

Also UTF-8 is being used more and more for all Internet related documents and email clients all support HTML encoded in UTF-8. Read more about it at Wikipedia.  

This is a good read too from email on acid.

I learned that most email clients IGNORE the content type in the HTML but instead use the content type from the email header.  That was certainly tricking me up yesterday. 

The problem was with File.ReadAllText.  The solution was to use the correct encoding.

File.ReadAllText(tempFilePath, System.Text.Encoding.GetEncoding(1252))

See also this Stack Overflow question.

Comments

Popular Posts