Monday, April 16, 2012

Solutions to invalid byte sequence in UTF-8 (ArgumentError)

Let's keep this short and simple. You're reading this because you too are tired of all the new character set issues in Ruby 1.9. This page lists some possible solutions and links to other pages with helpful info on solving "invalid byte sequence in UTF-8 (ArgumentError)".  

The Cause
This problem occurs when your application is expecting one character encoding and gets a different one.  


Possible Solutions:

1) Explicitly specify the encodings that you expect in the top-level Ruby Encoding class. Put this in a file that will be loaded (e.g., boot.rb for Rails):
Encoding.default_external = 'UTF-8' Encoding.default_internal = 'UTF-8'

2) Convert the character encoding on a lower level within your application
ascii_str.encode("UTF-8")
More info: http://blog.grayproductions.net/articles/ruby_19s_string


3) Finally, if the text you are working with has a mix of character encodings (occasionally this happens when data is aggregated improperly), then you may need to decide what is the most common character set and convert from that to the new char set (e.g., UTF-8), while ignoring any other unrecognizable characters.

You can do this using iconv from the command line or from within Ruby:
within Ruby:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string)
(source: http://smyck.net/2011/05/13/files-with-mixed-and-invalid-encodings-in-ruby/)

 command line:
iconv -f WINDOWS-1252 -t "UTF-8//IGNORE" some_text.txt > some_text.utf8.txt

4) Catch the error and handle it more gracefully:
begin # your code rescue ArgumentError => e print "error: #{e}" end