Hpricot and utf-8

I tried to use Hpricot to parse a page with special characters in a utf-8 encoding. The docs tell you to do this:

require 'rubygems'
require 'open-uri'
require 'hpricot'
 
doc = Hpricot(open("http://url/"))

However, this won’t give you the output you want. The open method on Open-URI leaves the output in the default character set of the page. If you want to convert it to utf-8, you need to use the iconv library:

require 'rubygems'
require 'iconv'
require 'open-uri'
require 'hpricot'
 
f = open("http://url")
f.rewind
doc = Hpricot(Iconv.conv('utf-8', f.charset, f.readlines.join("\n")))

Post to Twitter Post to Delicious Delicious Post to Digg Digg This Post Post to Facebook Facebook Post to Reddit Reddit This Post

No related posts.

Tags:

8 Responses to “Hpricot and utf-8”

  1. Peter Abrahamsen 10. Apr, 2008 at 8:29 pm #

    Thanks for this!

    N.B. you’re missing a close parenthesis on the end of the last line.

  2. Justus 25. Jul, 2008 at 1:14 pm #

    Fantastic! Thanks so much. This solved my problem – on which I have been researching the whole day – within 5 seconds!
    Thanks!

  3. Abel 06. Aug, 2008 at 9:57 am #

    Thanks!
    Now I wonder why open-uri doesn’t have an straight forward way of doing this.

  4. Dirceu Jr. 26. Sep, 2008 at 9:12 pm #

    Thanks so much! ;D

  5. Albert 16. Jul, 2009 at 10:49 pm #

    You are ma saviour. Thanks a lot.

  6. Albert 17. Jul, 2009 at 3:49 am #

    You are ma saviour. Thanks a lot.

  7. Claus 25. Jul, 2010 at 9:29 pm #

    Just wondering, what does the .rewind method do? Can’t really find it in the open-uri doc.

    Thanks for posting this

Leave a Reply