Ruby read a web page with encoding `GB2313`, how to check if the content contains some keyword?

I use ruby reading a web page, and its content is:

<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>

From the meta, we can see it uses a GB2312 encoding.

My code is:

res = Net::HTTP.post_form(URI.parse("http://xxx/check"), 
                              {:query=>'xxx'})

Then I use:

res.include?("中文")

to check if the content has that word. But if shows false.

I don't know why it is false, and what should I do? What encoding ruby 1.8.7 use? If I need to convert the encoding, how to do it?

Answers


Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.

Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:

res.include?("\xD6\xD0\xCE\xC4")

However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:

兄形男

in it, that would be encoded as "\xD0\xD6\xD0\xCE\xC4\xD0". Which contains the byte sequence "\xD6\xD0\xCE\xC4", so the include? would be true even though the characters 中文 are not present.

If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.


Need Your Help

how to get the all attributes that belongs to a attribute group in magento

magento magento-1.6

I have an attribute group and I want to get the list of attributes that belongs a particular attribute group. Is there any way to get all of the attributes under an attribute group?

Reload page in Node.js using Express and EJS

javascript node.js express ejs

I'm trying to reload the browser page when the underlying data model changes: