• <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            xiaoguozi's Blog
            Pay it forword - 我并不覺的自豪,我所嘗試的事情都失敗了······習慣原本生活的人不容易改變,就算現狀很糟,他們也很難改變,在過程中,他們還是放棄了······他們一放棄,大家就都是輸家······讓愛傳出去,很困難,也無法預料,人們需要更細心的觀察別人,要隨時注意才能保護別人,因為他們未必知道自己要什么·····

            在PHP中查找中文字符,有兩種方案。

            1、中文字符是gbk(gb2312)

            有兩種解決方法

            第一種:

            將PHP保存為ASCII編碼,然后使用strpos查找,如:

            strpos($curl_res, ‘哈哈’)

            第二種:

            將PHP保存為UTF-8無BOM編碼,然后轉換字符串編碼為UTF-8,再查找,如:

            $curl_res = mb_convert_encoding($curl_res, ‘utf-8′, ‘gbk’);

            mb_strpos($curl_res, ‘哈哈’);

            2、中文字符是UTF-8

            有兩種解決方法

            第一種:

            將PHP保存為UTF-8無BOM編碼,然后使用strpos查找,如:

            strpos($curl_res, ‘哈哈’)

            第二種:

            將PHP保存為ASCII編碼,然后轉換字符串編碼為gbk,再查找,如:

            $curl_res = mb_convert_encoding($curl_res, ‘gbk’, ‘utf-8′);

            mb_strpos($curl_res, ‘哈哈’);

            應該可以看出一些規律,就是:函數中的中文字符串參數的編碼和PHP文件保存格式的編碼一致,在使用函數時要考慮到!


                 我生成的那個html文件被EmEditor認為UTF-8 with Signature。而好用的那個html文件被EmEditor認為UTF-8 without Signature.
                對于這兩種UTF-8格式的轉換,我查看了網上信息,點擊記事本,EmEditor等文本編輯器的另存為,當選擇了UTF-8的編碼格式時,Add a Unicode Signature(BOM)這個選項被激活,只要選擇上,我的文件就可以存為UTF-8 with Signature的格式。可是,問題就在于,我用java怎么讓我的文件直接生成為 UTF-8 with Signature的格式。
                開始上google搜索UTF-8 with Signature,BOM,Add a Unicode Signature等關鍵字。
            http://www.unicode.org/unicode/faq/utf_bom.html#BOM
            我大致了解了他們兩個的區別。
            Q: What is a BOM?

            A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
            http://mindprod.com/jgloss/bom.html
            BOM
            Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32 bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros. Unicode Endian Markers
            Byte-order mark Description
            EF BB BF UTF-8
            FF FE UTF-16 aka UCS-2, little endian
            FE FF UTF-16 aka UCS-2, big endian
            00 00 FF FE UTF-32 aka UCS-4, little endian.
            00 00 FE FF UTF-32 aka UCS-4, big-endian.
            There are also variants of these encodings that have an implied endian marker.
            Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don't automatically filter them out. There is not much you can do but manually remove them.


            http://cache.baidu.com/c?word=java%2Cbom&url=http%3A//tgdem530%2Eblogchina%2Ecom/&b=0&a=1&user=baidu
            c、UTF的字節序和BOM
            UTF-8以字節為編碼單元,沒有字節序的問題。UTF-16以兩個字節為編碼單元,在解釋一個UTF-16文本前,首先要弄清楚每個編碼單元的字節序。 例如收到一個“奎”的Unicode編碼是594E,“乙”的Unicode編碼是4E59。如果我們收到UTF-16字節流“594E”,那么這是 “奎”還是“乙”?

            Unicode規范中推薦的標記字節順序的方法是BOM。BOM不是“Bill Of Material”的BOM表,而是Byte Order Mark。BOM是一個有點小聰明的想法:

            在UCS編碼中有一個叫做"ZERO WIDTH NO-BREAK SPACE"的字符,它的編碼是FEFF。而FFFE在UCS中是不存在的字符,所以不應該出現在實際傳輸中。UCS規范建議我們在傳輸字節流前,先傳輸字符"ZERO WIDTH NO-BREAK SPACE"。

            這樣如果接收者收到FEFF,就表明這個字節流是Big-Endian的;如果收到FFFE,就表明這個字節流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被稱作BOM。

            UTF-8不需要BOM來表明字節順序,但可以用BOM來表明編碼方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8編碼是EF BB BF(讀者可以用我們前面介紹的編碼方法驗證一下)。所以如果接收者收到以EF BB BF開頭的字節流,就知道這是UTF-8編碼了。

            Windows就是使用BOM來標記文本文件的編碼方式的。


            原來BOM是在文件的開始加了幾個字節作為標記。有了這個標記,一些協議和系統才能識別。好,看看怎么加上這寫字節。
            終于在這里找到了
            http://mindprod.com/jgloss/encoding.html 
            UTF-8 
            8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can't read the file without this marker. Now the question is, how do you get that marker in there? You can't just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.
            DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.
             
            prw.write( '\ufeff' );就是這個。
            于是我的代碼變為:
            public void htmlWrite(String charsetName) {
                    try {
                        out = new BufferedWriter(new OutputStreamWriter(
                                    new FileOutputStream(outFileName), "UTF-8"));
                        out.write('\ufeff');
                        out.write(res);
                        out.flush();

                        if (out != null) {
                            out.close();
                        }
                    } catch (Exception e) {
                        try {
                            if (out != null) {
                                out.close();
                            }
                        } catch (IOException e1) {
                            System.out.print("write errors!" + e);
                        }

                        System.out.print("write errors!" + e);
                    }
                }
            問題解決。

            posted on 2013-02-04 15:38 小果子 閱讀(2935) 評論(0)  編輯 收藏 引用 所屬分類: 學習筆記
            久久综合九色综合久99| 国产精品久久网| 亚洲精品NV久久久久久久久久| 国产精品成人久久久久久久 | 午夜精品久久久久久影视777| 久久久久亚洲AV成人网| 99久久综合国产精品免费| 亚洲第一永久AV网站久久精品男人的天堂AV| 久久精品一区二区三区中文字幕| 久久久久这里只有精品| 久久国产精品77777| 久久午夜电影网| 亚洲午夜精品久久久久久app| 亚洲熟妇无码另类久久久| 99久久婷婷免费国产综合精品| 久久午夜综合久久| 久久综合亚洲欧美成人| 国产成人综合久久精品尤物| 久久无码AV中文出轨人妻| 久久精品亚洲精品国产欧美| 久久精品国产乱子伦| 精品久久久久久国产免费了| 久久久av波多野一区二区| 日韩十八禁一区二区久久| 国产精品内射久久久久欢欢| av无码久久久久不卡免费网站| 久久91精品国产91久| 香蕉久久影院| 蜜桃麻豆www久久国产精品| 国产成人精品久久亚洲| 国产精品一久久香蕉产线看| 午夜不卡久久精品无码免费| 久久人妻AV中文字幕| 日日狠狠久久偷偷色综合0| 久久久久亚洲av成人无码电影| 国产成人精品综合久久久| 中文字幕一区二区三区久久网站| 精品久久一区二区三区| 99久久精品午夜一区二区| 波多野结衣中文字幕久久| 久久精品国产亚洲AV高清热|