Tuesday, February 23, 2010

Experimental Dada Mail w/unicode ¡Support! Released

(this is a repost from here, 'cause I'm pretty stoked on it)


This is the first step in the localization project, since we can't very well translate Dada Mail if Dada Mail can't use the translations available.

I have to let this project rest for a little bit (and collect my wits - it was a very difficult step!) but any and all feedback is welcome, if you'd like to give this a spin - bug reports/problems of any kind are very much appreciated.

This version of Dada Mail should basically be able to support any language that can in the unicode characters set and UTF-8 encoding. Which, should be, well, a lot of them. It doesn't (Dada Mail), but where it fails? I don't know - but it's a good time to test and see where it's wrong.

For simple Euro-centric stuff, like this:

Je peux manger du verre, ça ne me fait pas mal.

It should be fine. For something a little more wild:

أنا قادر على أكل الزجاج و هذا لا يؤلمني.

(which should be Arabic)

Well, I can only go on if something visually looks correct :) Even this email is sort of a test - I don't know if it's going to work, or not - so, fingers crossed! If it does - we're on a good track, since Dada Bridge taking a random email, having it go through the system that's mostly tested using a very specific way of creating emails and coming out readable on the other side is a great big step - not even talking about the online archive, rss/atom feeds, twitter thingie, etc, etc, etc.

Here's the download to the version I'm now running at the Dada Mail support site:

http://github.com/downloads/justingit/dada-mail/dada-4_0_2-unicode.zip

http://github.com/downloads/justingit/dada-mail/dada-4_0_2-unicode.tar.gz

If you want to check it out via github, the branch is at:

http://github.com/justingit/dada-mail/tree/charset_work

To grab it with git, you have to do this:


git clone git://github.com/justingit/dada-mail.git
cd dada-mail
git fetch
git checkout --track -b your_local_branch_name origin/charset_work


Here's the explanation of all that:

http://groups.google.com/group/github/browse_thread/thread/71f944b925467ab6

There's a guide of what to expect with Dada Mail and unicode/UTF-8 you can read here:

http://dadamailproject.com/support/documentation-4_0_2-unicode/features-UTF-8.pod.html

Which I'll paste the contents of at the end of this message - but you may also want to compare it to the version of this doc for 4.0.2 STABLE:

http://dadamailproject.com/support/documentation-4_0_2/features-UTF-8.pod.html

(Long story short: 4.0.2 UTF-8/Unicode Support: "uhh...")

And, that's about it. This was a hard part of the project, since this is a 10+ y/o codebase - it very much pre-dates even unicode/UTF-8 support in Perl itself, so there's a reason, I guess, why the program was in such bad shape when it came to support it. Many,

many many bugs showed themselves, once this feature was asked for. I think a great majority of them have been solved.

Give it a spin if this interests you and if I can help out with anything, let me know,


--
Introduction

Dada Mail can speak UTF-8 and almost expects that everything else around it does, too.

That means:

• It treats everything it handles as UTF-8
• Everything it returns is in UTF-8
How To Have a Pleasant Experience

If you're installing Dada Mail for the first time, there's nothing you'll need to do, but below are some great guidelines on how to keep your lists configured, so you continue to have a good experience.

If you're upgrading, make sure your configuration reflects the advice below.

It's heavily advised to keep everything in Dada Mail speaking UTF-8 without any real exceptions.

Config Variable: $HTML_CHARSET

By default, the config variable, $HTML_CHARSET is set to, UTF-8

Keep it that way, same case (UTF-8) - same everything.

Dada Mail is only tested with the charset set this way.

Advanced Sending Preferences

Default Character Set

Set this as, UTF-8 UTF-8

Default Plain Text/HTML Message Encoding

There's really only a few choices recommended for Dada Mail.

• 8bit
Should work.

• quoted-printable

If you have any trouble with 8bit, try quoted-printable. Because of the amount of time that Dada Mail creates, tweaks, formats and templates out email messages, the encoding can potentially get mucked up.

This potential mucking-up is mitigated when Dada Mail uses quoted-printable encoding internally. This should be the default for email messages.

Encode Message Headers

Have this option checked.

SQL Backends

Database

PostgreSQL

Encoding for PostgreSQL databases is done when the database is created - make sure to create your database with a, UTF-8 encoding, like so:

CREATE DATABASE dadamail WITH ENCODING 'UTF-8'
MySQL

Nothing you'll have to do.

SQLite

Nothing you'll have to do.

DBM Files

DBM Files have no encoding support, but Dada Mail knows this and compensates.

Schema

MySQL

The MySQL schemas are set to create tables with an encoding of, UTF-8

PostgreSQL

Nothing has changed.

SQLite

Nothing has changed.

Drivers

The current support SQL backends, mysql (MySQL), Pg (PostgreSQL) and SQLite all have different ways to somewhat, "enable" their UTF-8 support.

• MySQL
add,

mysql_enable_utf8 => 1,
has been added to the $DBI_PARAMS hashref.

• PostgreSQL
add,

pg_enable_utf8 => 1,
has been added to the $DBI_PARAMS hashref.

• SQLite
add,

sqlite_unicode => 1
has been added to the $DBI_PARAMS hashref.

No explicit encoding/decoding is done in Dada Mail when saving/retrieving data. Hopefully, the drivers are UTF-8-aware enough.

Plugins/Extensions

The Plugins and Extensions that come with Dada Mail have not been as thoroughly tested as the main program. There's still warts.

Dada Bridge

Dada Bridge has a unique position needing to handle a lot of different stuff thown at it and deal with it gracefully. Dada Mail does, in fact, handle, any realistic character set/encoding you throw at it, but Dada Mail will convert messages it receives to its internal format, before it resends it out to your list.

This means the encoding of your choice (8bit or quoted-printable) and the charset of your choice (as long as your charset is, UTF-8)

Upgrading

You are potentially going to have problems.

Its possible that, since List Settings were never decoded/encoded correctly in past versions, they'll show up the program (once you've upgrade) incorrectly. The easiest thing to do is to edit the mistakes and resave the information. For most of the program, you're going to have to manually export the information and re-import it with the correct encoding, sadly. Dada Mail will probably fail gracefully with old information, but it's possible that you'll see squiggly charaters, instead of what you want to see. There's nothing in Dada Mail that will stop this from happening. If you experience it (from old information), we're not going to count it as a bug, but rather a known issue.

Problems?

Please let us know via the Support Boards:

http://dadamailproject.com/support/boards/

Or the developer mailing list:

http://dadamailproject.com/cgi-bin/dada/mail.cgi/list/dadadev/

Thanks!

See Also:

• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

• perlunitut - Perl Unicode Tutorial

http://perldoc.perl.org/perlunitut.html

• perlunifaq - Perl Unicode FAQ

http://perldoc.perl.org/perlunifaq.html



--

Post:
mailto:dadadev@dadamailproject.com

Unsubscribe:
http://dadamailproject.com/cgi-bin/dada/mail.cgi/u/dadadev/

List Information:
http://dadamailproject.com/cgi-bin/dada/mail.cgi/list/dadadev

Archive:
http://dadamailproject.com/cgi-bin/dada/mail.cgi/archive/dadadev

Developer Info:
http://dev.dadamailproject.com

Saturday, February 20, 2010

Perl HTML::Template and UTF-8 Unicode

HTML::Template does not support file encoding:


#!/usr/bin/perl -w
use strict;
use Encode;
use HTML::Template;
my $template = HTML::Template->new(
filehandle => *DATA,
);
print Encode::encode('UTF-8', $template->output);
__DATA__
¡™£¢∞§¶•ªº

prints, ¡™£¢∞§¶•ªº (or something like that!)

In the example above, this makes sense, since we're printing on an open filehandle (even if it's only to our magical, DATA) that we didn't put a file layer filter thingy to. That's easy to fix:



#!/usr/bin/perl -w
use strict;
use Encode;

binmode DATA, ':encoding(UTF-8)';

use HTML::Template;
my $template = HTML::Template->new(
filehandle => *DATA,
);
print Encode::encode('UTF-8', $template->output);
__DATA__
¡™£¢∞§¶•ªº
prints, ¡™£¢∞§¶•ªº, yay!


This also works if we want to just pass a reference to a scalar to HTML::Template:


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";
use HTML::Template;
my $template = HTML::Template->new(
scalarref => \$content,
);
print Encode::encode('UTF-8', $template->output);
prints, ¡™£¢∞§¶•ªº, yay!

This doesn't work, if we want to just give it a name of a template file. This is really useful, since HTML::Template has a feature to allow you to search through a file structure (or at least an array of directories, looking for the file).

And this is where encoding madness begins.

Cause I know what you're thinking, just treat HTML::Template's output like information that's coming from outside your program (since, if you're using a template *file*, it kinda is).

So, all you need to do is decode (this is the WRONG WAY to solve the problem, but let's just make that mistake...) the return value of ->output, like this:


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;

use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
);

my $output = $template->output;
$output = Encode::decode('UTF-8', $output);

print Encode::encode('UTF-8', $output);


prints, ¡™£¢∞§¶•ªº. Yes.

But... what if you have a variable (it is a templating system) and the variable in the param() you pass has UTF-8 strings? MUAHAHA!


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "
<!-- tmpl_var one -->
\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}
";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;


use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
);
$template->param(
one => "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}",
);

my $output = $template->output;
$output = Encode::decode('UTF-8', $output);

print Encode::encode('UTF-8', $output);

Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.


Bahahaha!

Take those decode/encode lines (I know it looks strange to one, right after the other ) and you'll still get a weird output:


¡™£¢∞§¶•ªº
¡™£¢∞§¶•ªº


Darned if you do/don't. Those two lines should have the same string. They don't. No amount of encoding/decoding is going to help.


The trick, other than tweaking HTML::Template's source to include file filter layer thingamabobs, is to decode the contents of the file it opens up.

How to do that.

Trolling through the HTML::Template mailing list archives leads to the idea of using a HTML::Template filter that matches everything, that then does our decoding:



#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "
<!-- tmpl_var one -->
\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}
";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;


use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
filter => [
{ sub => \&decode_str, format => 'scalar' },
],
);
$template->param(
one => "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}",
);

my $output = $template->output;


print Encode::encode('UTF-8', $output);



sub decode_str {
my $ref = shift;
${$ref} = Encode::decode('UTF-8', ${$ref});
}

This sort of lines up all the data to be UTF-8 encoded and aware and all that stuff that the unicodefaqthingy perldoc tells you to do.

But, oh, it gets better.

DON'T use that filter trick thing if you're using a scalarref, or a properly encoded file handle! You'll get a nice error, like this:

HTML::Template->new() : fatal error occured during filter call: Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
at /Library/Perl/5.10.0/HTML/Template.pm line 1697
HTML::Template::_init_template('HTML::Template=HASH(0x1008aafb8)') called at /Library/Perl/5.10.0/HTML/Template.pm line 1238
HTML::Template::_init('HTML::Template=HASH(0x1008aafb8)') called at /Library/Perl/5.10.0/HTML/Template.pm line 1124



Brilliant.


So I don't know what the best advice is to give. If you're passing the template as a scalarref, DON'T use that filter, unless you want to, perhaps encode your template beforehand (which makes little sense?)

If it's a filename, use that filter trick perhaps (or edit the sourcecode of HTML::Template).

Monday, February 15, 2010

Perl, UTF-8 Email Messages, MIME::Enity and QuotedPrintable encoding

Some findings after much bashing of head:


#!/usr/bin/perl -w
use strict;

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
Data => Encode::encode('UTF-8', $content),
Encoding => 'quoted-printable',
);



# MAKE SURE TO ALWAYS decode('UTF-8', 'string') BEFORE WORKING WITH STRING
# Always.
my $new_content = $pt_entity->bodyhandle->as_string;
$new_content = Encode::decode('UTF-8', $new_content);

# For example, we're just going to reverse it:
$new_content = reverse($new_content);




my $io = $pt_entity->bodyhandle->open('w');

# YES. You will will need to encode content using the bodyhandle. Always.
# Always.

$new_content = Encode::encode('UTF-8', $new_content);
$io->print($new_content);
$io->close;
$pt_entity->sync_headers(
'Length' => 'COMPUTE',
'Nonstandard' => 'ERASE'
);


# And, that's it.



# Before using the content, decode
# Always.
my $result = $pt_entity->bodyhandle->as_string;
$result = Encode::decode('UTF-8', $result);


# Always encode, before printing.
# Always.
#
# prints, ºª•¶§∞¢£™¡
print Encode::encode('UTF-8', $result);


The trick is to always, always, always encode your data, before creating any sort of entity using MIME::Entity and to always, always always decode the data you get using bodyhandle()

This workflow is strange, since you're told not to encode data, until you're ready to print it. I suspect there's some weird IO::File stuff going on with MIME::Entity (and friends), or, want to think of saving binary data, instead of characters when creating MIME stuff. I don't know.

If you do not encode before, MIME::Entity will barf, when using the quoted/printable encoding, but will probably be just fine with, "8bit" encoding.

This was a huge headache to figure out.

This will all seem to work out, if you don't do that first encode:


#!/usr/bin/perl -w
use strict;

use lib qw(/Users/justin/Documents/DadaMail/git/dada-mail/dada/DADA/perllib);

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
# Data => Encode::encode('UTF-8', $content),
Data => $content,
Encoding => 'quoted-printable',
);

my $s = $pt_entity->bodyhandle->as_string;

$s = Encode::decode('UTF-8', $s);

# Let's do a little string manip:
$s = reverse($s);

$s = Encode::encode('UTF-8', $s);

print $s;


Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.

exception_handler::die in Encode.pm at line 162
Encode::decode in test7.pl at line 28






So do that first encode, please. If you don't follow this formula, your prog may work, until that last encode:


 
#!/usr/bin/perl -w
use strict;

use lib qw(/Users/justin/Documents/DadaMail/git/dada-mail/dada/DADA/perllib);

use MIME::Entity;
use Encode;

# My UTF-8 string -
# ¡™£¢∞§¶•ªº
# Basically using Mac OS X, just hold down the alt/option key and hit the 1 through 0 keys, in succession:
#
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

# Build the message, using MIME::Entity.
# MAKE SURE TO ALWAYS encode('UTF-8', 'string') BEFORE ADDING
# Always.

my $pt_entity = MIME::Entity->build(
Type => 'text/plain',
# Data => Encode::encode('UTF-8', $content),
Data => $content,
Encoding => 'quoted-printable',
);

my $s = $pt_entity->bodyhandle->as_string;

# NAW, we don't need that
# $s = Encode::decode('UTF-8', $s);

# Let's do a little string manip:
$s = reverse($s);

# Well, that's silly! We don't need that one, either!
# $s = Encode::encode('UTF-8', $s);

print $s;



Wide character in print at /Users/justin/Desktop/test7.pl line 37.


And, you will do what I do, and bang your head, some more.

I couldn't fine any info on how to handle things like MIME::Entity and UTF-8 encoding, in the excellent articles available such as this one:

http://ahinea.com/en/tech/perl-unicode-struggle.html

and,

http://perlgeek.de/en/article/encodings-and-unicode

and,

http://juerd.nl/site.plp/perluniadvice


I have this article labeled as, "do not trust"

http://kbinstuff.googlepages.com/perl,unicodeutf8,cgi.pm,apache,mod_perla

Because it states,

6.1. Encode::encode/decode

For start, you should avoid using Encode::encode/decode/from_to to the greatest possible extent in your scripts. This will only lead to great confusion later. You may think you have gotten everything to work, but then a week later, you shall only add a little more functionality to your work and suddenly, everything falls apart and doodles will appear on your web pages.



I guess I understand what they mean - but you'll need to encode your UTF-8 stuff before it exits your program. Always. And, you have to decode UTF-8 info that goes out of your program. Always. How do you do this? Uh-huh, the Encode module.

Like it says in the perldoc for unicode. So, I don't know what this page is yabbering about. I'm sure, behind the scense, Encode is used when open files with a specific encoding:

http://perldoc.perl.org/perluniintro.html#Unicode-I/O

Which, by the way of features, is a pretty rad one.