• Stars
    star
    110
  • Rank 316,770 (Top 7 %)
  • Language
    PHP
  • License
    MIT License
  • Created almost 13 years ago
  • Updated about 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Helps convert incorrect charset latin1 columns to utf8

Copyright (c) 2013 Nic Jansma http://nicj.net

See nicj.net for a description of the problem and how this script aids in correcting the issue.

Introduction

This script automates the conversion of any UTF-8 data stored in MySQL latin1 columns to proper UTF-8 columns.

I've modified fabio's script to automate the conversion for all of the latin1 columns for whatever database you configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8_general_ci, while retaining the column lengths, defaults and NULL attributes.

Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script!

Here are the steps you should take to use the script:

Determine Which Columns Need Updating

If you're like me, you may have a mixture of latin1 and UTF-8 columns in your databases. Not all of the columns in my database needed to be updated from latin1 to UTF-8. For example, some of the tables belonged to other PHP apps on the server, and I only wanted to update the columns that I knew had to be fixed. The script will currently convert all of the tables for the specified database - you could modify the script to change specific tables or columns if you need.

Additionally, the script will only update appropriate text-based columns. Character sets are only appropriate for some types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and BLOBs do not have a "character set".

ENUM and SET column types can be converted only if all of the enum possibilities only use characters in the 0-127 ASCII character set. If you have ENUMs or SETs that satisfy this criteria, look for the relevant TODO: in the script.

You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a SQL query against the information_schema:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%";
...
115 rows in set (0.03 sec)

Test Convert the Columns

You should test all of the changes before committing them to your database.

The first thing to test is that the SQL generated from the conversion script is correct. To do this, you can dump the structure of your database:

server> mysqldump --no-data -h localhost -u dbuser -p mydatabase > structure.sql

And import this structure to another test MySQL database:

server> mysql -u dbuser -p mydatabase_test < structure.sql

Next, run the conversion script (below) against your temporary database:

server> php -f mysql-convert-latin1-to-utf8.php

The script will spit out "!!! ERROR" statements if a change fails. If you encounter ERRORs, modifications may be needed based on your requirements. Some of the common problems are listed in Step 3.

After you run the script against your temporary database, check the information_schema tables to ensure the conversion was successful:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable";

As long as you see all of your columns in UTF8, you should be all set!

Problems You May Encounter

Some of the issues you may encounter:

FULLTEXT indexes

I have several columns with FULLTEXT indexes on them. The ALTER TABLE to BINARY command for a column that has a FULLTEXT index will cause an error:

mysql> ALTER TABLE MyTable MODIFY MyColumn BLOB;
ERROR 1283 (HY000): Column 'MyColumn' cannot be part of FULLTEXT index

The simple solution I came up with was to modify the script to drop the index prior to the conversion, and restore it afterward:

ALTER TABLE MyTable DROP INDEX `mycolumn_fulltext`

... (convert all columns) ...

ALTER TABLE MyTable ADD FULLTEXT KEY `mycolumn_fulltext` (`MyColumn`)

There are TODOs listed in the script where you should make these changes.

Invalid UTF-8 data

Since my database was over 5 years old, it had acquired some cruft over time. I'm not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I'm not sure. I found this out when initially trying to do the conversion:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(3000) NOT NULL DEFAULT '';
Query OK, 21171 rows affected (0.66 sec)

mysql> ALTER TABLE MyTable MODIFY MyColumn varchar(3000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '';
ERROR 1366 (HY000): Incorrect string value: '\xE2\x80? fl...' for column 'MyColumn' at row 128

What's going on?

At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences.

I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too:

?? flown

The ? is the actual character that your browser shows. Not the best user experience, and definitely not the correct character.

I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again, and hit the same issue, another row. Looks like there is more than a single corrupt row.

I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail.

For example:

mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
       FROM MyTable
       WHERE CONVERT(MyColumn USING utf8) IS NULL
...
5 rows in set, 10 warnings (0.05 sec)

This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you're not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted. Only 30 rows in total were corrupt.

You may also want to use utf8mb4 instead of utf8 as your collation. This means you would set $defaultCollation='utf8mb4_unicode_ci';.

Usage

First, read over the script and make sure you understand what it does. If you don't understand what it's doing, you probably shouldn't run it.

Next, check all of the TODO:s in the script. You will need to make some changes to get it to work.

At this point, it may take some guts for you to hit the go button on your live database.

php -f mysql-convert-latin1-to-utf8.php

Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data.

Warning: Please be careful when using the script and test, test, test before committing to it!

Version History

  • v1.0 - 2011-04-17: Initial release
  • v1.1 - 2013-01-25: Added possible ENUM support via patrick-mcdougle
  • v1.2 - 2013-03-26: Added SET support and the ability to convert from multiple collations, as well as bulk-doing conversion in one statement for quicker changes via Synchro
  • v1.3 - 2017-05-06: Allows for config.php separate from script via bderubinat

Credits

Initially based on fabio's script.

Modified by Nic Jansma

Contributions by:

More Repositories

1

usertiming.js

UserTiming polyfill
JavaScript
181
star
2

node-handlebars-precompiler

Share your Handlebars templates between Node and the browser with minimal effort.
JavaScript
69
star
3

rename-regex

Windows command-line regex file renamer
C#
65
star
4

resourcetiming-compression.js

ResourceTiming compression and decompression
JavaScript
59
star
5

usertiming-compression.js

UserTiming Compression
JavaScript
37
star
6

saltthepass.js

SaltThePass.com algorithm to generated salted passwords
JavaScript
30
star
7

dht-logger

Particle (Spark) Photon/Electron DHT Temperature and Humidity Logger
C++
29
star
8

synology-windows-ssh-rsync-backup

Synology Windows computer backup via rsync over SSH
Shell
29
star
9

deploy-aws-docker-openvpn

Deploys OpenVPN in a container to AWS
HCL
23
star
10

plex-poster-optimizer

Ensures Plex Posters are high-quality, e.g. above a minimum height/width threshold.
JavaScript
15
star
11

cdn-detector.js

Detects CDN usage from HTTP hostname and response headers (NodeJS/browser)
JavaScript
15
star
12

spark-core-water-sensor

Spark Core Water (+Temp + Humidity) Sensor with SMS Alerts Using IFTTT and SmartThings integration
C++
12
star
13

amazon-cloud-backup

DIY Cloud Backup using Amazon EC2 and EBS
Shell
10
star
14

smart-things

SmartThings Devices and Apps
Groovy
10
star
15

checksum-verifier

Windows command-line tool that validates the integrity of your files.
C#
9
star
16

todoist-backup

Todoist.com backup
C#
8
star
17

puppetarazzi

Takes screenshots of your website and annoys you by pointing out your flaws (mixed-content, uncompressed content, 404s, etc)
JavaScript
8
star
18

talks

Talks and presentations
JavaScript
7
star
19

breakup.js

Yielding enumeration replacement functions for async.forEachSeries() and jQuery.each()
JavaScript
6
star
20

tisktasks

TiskTasks for Todoist Android app
Java
5
star
21

adblock-detector.js

Detects ad-blockers
JavaScript
5
star
22

mediawiki-Extension-SpecialUserScore

MediaWiki SpecialUserScore extension
PHP
5
star
23

resourcetiming-visibility

Crawls the Top N Alexa sites, comparing how many resources are visible from ResourceTiming vs. what the browser downloads
JavaScript
5
star
24

mymovies2plex

Converts My Movies Watched/Unwatched for Movies and TV Shows to Plex
JavaScript
4
star
25

economist-to-kindle

Economist to Kindle
PHP
4
star
26

mediawiki-ConfirmEdit-AreYouAHuman

NO LONGER RECOMMENDED: AreYouAHuman ConfirmEdit extension for MediaWiki
PHP
4
star
27

mediawiki-decompress

Decompresses a MediaWiki text table that previously had $wgCompressRevisions set to true
PHP
3
star
28

cpuprofile-filter

.cpuprofile filter
JavaScript
3
star
29

reliably-measuring-responsiveness-in-the-wild

JavaScript
2
star
30

migrate-b2evolution-to-wordpress

Migrates a b2evolution blog to WordPress
PHP
2
star
31

com.nicjansma.library

Nic's Android / Java utilities library
Java
1
star
32

perfbar.js

Floating performance toolbar, powered by Boomerang
JavaScript
1
star
33

tampermonkey

Tampermonkey Scripts
JavaScript
1
star
34

third-party-lighthouse-audit

Runs a Lighthouse Audit with third-parties added or removed
JavaScript
1
star