Monday, February 15, 2010

Perl, UTF-8 Email Messages, MIME::Enity and QuotedPrintable encoding

Some findings after much bashing of head:


#!/usr/bin/perl -w
use strict;

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
Data => Encode::encode('UTF-8', $content),
Encoding => 'quoted-printable',
);



# MAKE SURE TO ALWAYS decode('UTF-8', 'string') BEFORE WORKING WITH STRING
# Always.
my $new_content = $pt_entity->bodyhandle->as_string;
$new_content = Encode::decode('UTF-8', $new_content);

# For example, we're just going to reverse it:
$new_content = reverse($new_content);




my $io = $pt_entity->bodyhandle->open('w');

# YES. You will will need to encode content using the bodyhandle. Always.
# Always.

$new_content = Encode::encode('UTF-8', $new_content);
$io->print($new_content);
$io->close;
$pt_entity->sync_headers(
'Length' => 'COMPUTE',
'Nonstandard' => 'ERASE'
);


# And, that's it.



# Before using the content, decode
# Always.
my $result = $pt_entity->bodyhandle->as_string;
$result = Encode::decode('UTF-8', $result);


# Always encode, before printing.
# Always.
#
# prints, ºª•¶§∞¢£™¡
print Encode::encode('UTF-8', $result);


The trick is to always, always, always encode your data, before creating any sort of entity using MIME::Entity and to always, always always decode the data you get using bodyhandle()

This workflow is strange, since you're told not to encode data, until you're ready to print it. I suspect there's some weird IO::File stuff going on with MIME::Entity (and friends), or, want to think of saving binary data, instead of characters when creating MIME stuff. I don't know.

If you do not encode before, MIME::Entity will barf, when using the quoted/printable encoding, but will probably be just fine with, "8bit" encoding.

This was a huge headache to figure out.

This will all seem to work out, if you don't do that first encode:


#!/usr/bin/perl -w
use strict;

use lib qw(/Users/justin/Documents/DadaMail/git/dada-mail/dada/DADA/perllib);

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
# Data => Encode::encode('UTF-8', $content),
Data => $content,
Encoding => 'quoted-printable',
);

my $s = $pt_entity->bodyhandle->as_string;

$s = Encode::decode('UTF-8', $s);

# Let's do a little string manip:
$s = reverse($s);

$s = Encode::encode('UTF-8', $s);

print $s;


Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.

exception_handler::die in Encode.pm at line 162
Encode::decode in test7.pl at line 28






So do that first encode, please. If you don't follow this formula, your prog may work, until that last encode:


 
#!/usr/bin/perl -w
use strict;

use lib qw(/Users/justin/Documents/DadaMail/git/dada-mail/dada/DADA/perllib);

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
# Data => Encode::encode('UTF-8', $content),
Data => $content,
Encoding => 'quoted-printable',
);

my $s = $pt_entity->bodyhandle->as_string;

# NAW, we don't need that
# $s = Encode::decode('UTF-8', $s);

# Let's do a little string manip:
$s = reverse($s);

# Well, that's silly! We don't need that one, either!
# $s = Encode::encode('UTF-8', $s);

print $s;



Wide character in print at /Users/justin/Desktop/test7.pl line 37.


And, you will do what I do, and bang your head, some more.

I couldn't fine any info on how to handle things like MIME::Entity and UTF-8 encoding, in the excellent articles available such as this one:

http://ahinea.com/en/tech/perl-unicode-struggle.html

and,

http://perlgeek.de/en/article/encodings-and-unicode

and,

http://juerd.nl/site.plp/perluniadvice


I have this article labeled as, "do not trust"

http://kbinstuff.googlepages.com/perl,unicodeutf8,cgi.pm,apache,mod_perla

Because it states,

6.1. Encode::encode/decode

For start, you should avoid using Encode::encode/decode/from_to to the greatest possible extent in your scripts. This will only lead to great confusion later. You may think you have gotten everything to work, but then a week later, you shall only add a little more functionality to your work and suddenly, everything falls apart and doodles will appear on your web pages.



I guess I understand what they mean - but you'll need to encode your UTF-8 stuff before it exits your program. Always. And, you have to decode UTF-8 info that goes out of your program. Always. How do you do this? Uh-huh, the Encode module.

Like it says in the perldoc for unicode. So, I don't know what this page is yabbering about. I'm sure, behind the scense, Encode is used when open files with a specific encoding:

http://perldoc.perl.org/perluniintro.html#Unicode-I/O

Which, by the way of features, is a pretty rad one.

No comments: