computer science, math, programming and other stuff
a blog by Christopher Swenson

# Painful sed script of Wordpress doom

So, I figured out most of the problems, and how to solve them.

Problem one: pasting text from Word stores the document in Windows-1252 encoding.

Problem two: this is stored in MySQL, by default, in the deceptively-similar-to-but-incompatible-with-1252 ISO-8859-1 encoding.

Problem three: Wordpress is smart, and automatically converts 1252 into HTML entities, which hides the problem until you dump the database.

Problem four: UTF-8 is freely intermingled with 1252 text in the entries.

Problem five: Windows does really weird things to encode non-1252 characters (like smart quotes) in a 1252 character stream.

Problem six: Some piece of software went wrong at some point (probably PHP or phpMyAdmin) , and converted the 1252, non-1252 weird smart quotes, ISO-8859-1, and UTF-8 codes to UTF-8 (by converting the byte literal values instead of the underlying characters, which for UTF-8 codes means that you have to decode twice to get the real answer), but MySQL keeps the text marked as ISO-8859-1.

Problem seven: you are me, and you have 2,300 of these invalid characters in a MySQL dump that needs to be imported.

Problem eight: Mac OS X, by default, uses a sub-standard version of sed that does not support replacement of literal values ("\xC2", for example), making it impossible to replace any non-ASCII characters.

Solution: install sed from Fink, and run the following sed script on your MySQL dump:

s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g
s/\xc3\xa2\xe2\x82\xac\xe2\x84\xa2/\&rsquo;/g
s/\xc3\xa2\xe2\x82\xac\xe2\x80\x9d/\&mdash;/g
s/\xE2\x80\x9C/\&ndash;/g
s/\xc3\xa2\xe2\x82\xac\xc2\xa2/\&bull;/g
s/\xC3\xA2\xE2\x82\xAC\xC5\x93/\&ldquo;/g
s/\xC3\xA2\xE2\x82\xAC\xC2\x9D/\&rdquo;/g
s/\xC3\xA2\xE2\x82\xAC\xC2\x9C/\&lsquo;/g
s/\xC3\xA2\xE2\x82\xAC/\&rdquo;/g
s/\xC3\x85\xE2\x80\x99/\&lsquo;/g
s/\xc3\x82\xc2\xba/\&deg;/g
s/\xC3\x82\xC2\xA9/\&copy;/g
s/\xc3\xa2\xC2\xA6/\&hellip;/g
s/\xC3\x82\xC2\xA3/\&pound;/g
s/\xC3\x83\xC2\xA3/\&aacute;/g
s/\xc3\x83\xc2\xa1/\&aacute;/g
s/\xC3\x83\xC2\xA8/\&egrave;/g
s/\xC3\x83\xC2\xA9/\&eacute;/g
s/\xC3\x83\xC2\xA7/\&ccedil;/g
s/\xc3\x82\xc2\xae/\&reg;/g
s/\xc3\x83\xc2\x81/\&Aacute;/g
s/\xc3\x83\xc2\xab/\&euml;/g
s/\xc3\x82\xc2\xbd/\&frac12;/g
s/\xc3\x82\xc2\xbc/\&frac14;/g