Upgrade charset conversion

Upgrade charset conversion - I'm stuck Started by emanuele · August 09, 2014, 10:43:22 am · Read 5803 times 0 Members and 1 Guest are viewing this topic. previous topic - next topic

Upgrade charset conversion - I'm stuck

August 09, 2014, 10:43:22 am

Yesterday I tracked the conversion to UTF-8 of the database during the upgrade as release blocker.

Yesterday I started playing with it.
It's doable, there is just one problem I'm not sure how to fix: determine the "source" charset.
In SMF-world, this is done by using the language files (that is quite unreliable if you ask me, but okay), unfortunately, during an upgrade from SMF to ElkArte there is basically no way to be sure 1) the language files are present and 2) where the language files could be.

Any idea how to solve this?
If it is solvable at all.
Should we just rely on the column collation? (Even though it's not reliable as well as far as I understood... $:-\$ )

Re: Upgrade charset conversion - I'm stuck

Reply #1 – August 09, 2014, 09:13:31 pm

I can't immediately think of any "fool proof" way to do this, but I've not really looked in to the issues before.

Seems like you could only trust the column collation if its utf8, other than that you know you are converting. I'm trying to think of the downside of just using the existing column collation, meaning if for some reason its not correct for the language, I'd think there are (or could be depending on the code point correlation) errors already present (and we are not trying to fix that).

What did SMF do with the language files, I mean in terms of the conversion?

Re: Upgrade charset conversion - I'm stuck

Reply #2 – August 09, 2014, 09:48:52 pm

I read the OP incorrectly with my initial response..
You can have it query the collation via mysql whereas SHOW TABLE STATUS should work fine for that.

This little project may interest you: https://github.com/neitanod/forceutf8
The developer claims his routine(s) will convert any mixed collation chars to utf-8.

Re: Upgrade charset conversion - I'm stuck

Reply #3 – August 10, 2014, 12:21:01 pm

Collation and charset are two different things, and in theory (I think) the two can be different for the same column.
I read a bit around and querying the information_schema it's possible to grab the charset of the column as well.

SMF simply trusts the $txt['lang_character_set'] in index.{default_language}.php. Whatever it is, it is supposed to the "correct", but then the user is given the option to chose another one...

I guess I can do the same, just grabbing the collation instead of the $txt, fill in a <select> with the possible (source) charsets and hope for the best... right?

Re: Upgrade charset conversion - I'm stuck

Reply #4 – August 10, 2014, 03:34:48 pm

Since the owner of said SMF package has the ability to possibly manipulate the charset setting then yes I would say that gathering the data from a mysql query would be best.

I worked with something like that [here] but now that you point out that collation and charset may not match up I suppose I did that incorrectly.
I see the proper way of doing it is noted [here].

Code: [Select]

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%'

Regards.

Re: Upgrade charset conversion - I'm stuck

Reply #5 – August 11, 2014, 11:51:57 am

Okay, FWIW, I just decided to rely on the collation and give users a dropdown to pick the charset from.
I'm testing it right now and it seems to work properly, at least with the dataset I'm working on.
I forced the backup (better safe than sorry, sorry.

), and then run the conversion on the database. Tables without any text field are not converted (as expected I would think), all the others are converted to UTF8 and I can't see anything wrong in the result.

I'm pushing the changes, if anyone has time, feel free to test it! It needs testing indeed.

Here it is the commit: https://github.com/emanuele45/Dialogo/commit/3649c2cb1a9d14a86f68411edc6c12d5a58ffc6e