伊人色综合久久天天,国内精品久久久久影院亚洲,欧美精品一区二区精品久久

UTF-8 BOM

在PHP中查找中文字符，有兩種方案。

1、中文字符是gbk（gb2312）

有兩種解決方法

第一種：

將PHP保存為ASCII編碼，然后使用strpos查找，如：

strpos($curl_res, ‘哈哈’)

第二種：

將PHP保存為UTF-8無(wú)BOM編碼，然后轉(zhuǎn)換字符串編碼為UTF-8，再查找，如：

$curl_res = mb_convert_encoding($curl_res, ‘utf-8′, ‘gbk’);

mb_strpos($curl_res, ‘哈哈’);

2、中文字符是UTF-8

有兩種解決方法

第一種：

將PHP保存為UTF-8無(wú)BOM編碼，然后使用strpos查找，如：

strpos($curl_res, ‘哈哈’)

第二種：

將PHP保存為ASCII編碼，然后轉(zhuǎn)換字符串編碼為gbk，再查找，如：

$curl_res = mb_convert_encoding($curl_res, ‘gbk’, ‘utf-8′);

mb_strpos($curl_res, ‘哈哈’);

應(yīng)該可以看出一些規(guī)律，就是：函數(shù)中的中文字符串參數(shù)的編碼和PHP文件保存格式的編碼一致，在使用函數(shù)時(shí)要考慮到！

   我生成的那個(gè)html文件被EmEditor認(rèn)為UTF-8 with Signature。而好用的那個(gè)html文件被EmEditor認(rèn)為UTF-8 without Signature.
    對(duì)于這兩種UTF－8格式的轉(zhuǎn)換，我查看了網(wǎng)上信息，點(diǎn)擊記事本，EmEditor等文本編輯器的另存為，當(dāng)選擇了UTF-8的編碼格式時(shí)，Add a Unicode Signature(BOM)這個(gè)選項(xiàng)被激活，只要選擇上，我的文件就可以存為UTF-8 with Signature的格式。可是，問(wèn)題就在于，我用java怎么讓我的文件直接生成為 UTF-8 with Signature的格式。
    開(kāi)始上google搜索UTF-8 with Signature,BOM,Add a Unicode Signature等關(guān)鍵字。
http://www.unicode.org/unicode/faq/utf_bom.html#BOM
我大致了解了他們兩個(gè)的區(qū)別。
Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
http://mindprod.com/jgloss/bom.html
BOM
Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32 bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros. Unicode Endian Markers
Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
There are also variants of these encodings that have an implied endian marker.
Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don't automatically filter them out. There is not much you can do but manually remove them.

http://cache.baidu.com/c?word=java%2Cbom&url=http%3A//tgdem530%2Eblogchina%2Ecom/&b=0&a=1&user=baidu
c、UTF的字節(jié)序和BOM
UTF-8以字節(jié)為編碼單元，沒(méi)有字節(jié)序的問(wèn)題。UTF-16以?xún)蓚€(gè)字節(jié)為編碼單元，在解釋一個(gè)UTF-16文本前，首先要弄清楚每個(gè)編碼單元的字節(jié)序。例如收到一個(gè)“奎”的Unicode編碼是594E，“乙”的Unicode編碼是4E59。如果我們收到UTF-16字節(jié)流“594E”，那么這是 “奎”還是“乙”？

Unicode規(guī)范中推薦的標(biāo)記字節(jié)順序的方法是BOM。BOM不是“Bill Of Material”的BOM表，而是Byte Order Mark。BOM是一個(gè)有點(diǎn)小聰明的想法：

在UCS編碼中有一個(gè)叫做"ZERO WIDTH NO-BREAK SPACE"的字符，它的編碼是FEFF。而FFFE在UCS中是不存在的字符，所以不應(yīng)該出現(xiàn)在實(shí)際傳輸中。UCS規(guī)范建議我們?cè)趥鬏斪止?jié)流前，先傳輸字符"ZERO WIDTH NO-BREAK SPACE"。

這樣如果接收者收到FEFF，就表明這個(gè)字節(jié)流是Big-Endian的；如果收到FFFE，就表明這個(gè)字節(jié)流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被稱(chēng)作BOM。

UTF-8不需要BOM來(lái)表明字節(jié)順序，但可以用BOM來(lái)表明編碼方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8編碼是EF BB BF（讀者可以用我們前面介紹的編碼方法驗(yàn)證一下）。所以如果接收者收到以EF BB BF開(kāi)頭的字節(jié)流，就知道這是UTF-8編碼了。

Windows就是使用BOM來(lái)標(biāo)記文本文件的編碼方式的。

原來(lái)BOM是在文件的開(kāi)始加了幾個(gè)字節(jié)作為標(biāo)記。有了這個(gè)標(biāo)記，一些協(xié)議和系統(tǒng)才能識(shí)別。好，看看怎么加上這寫(xiě)字節(jié)。
終于在這里找到了
http://mindprod.com/jgloss/encoding.html
UTF-8
8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can't read the file without this marker. Now the question is, how do you get that marker in there? You can't just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.
DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.

prw.write( '\ufeff' );就是這個(gè)。
于是我的代碼變?yōu)椋?br /> public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write('\ufeff');
            out.write(res);
            out.flush();

            if (out != null) {
                out.close();
            }
        } catch (Exception e) {
            try {
                if (out != null) {
                    out.close();
                }
            } catch (IOException e1) {
                System.out.print("write errors!" + e);
            }

            System.out.print("write errors!" + e);
        }
    }
問(wèn)題解決。

posted on 2013-02-04 15:38 小果子閱讀(2939) 評(píng)論(0) 編輯收藏引用所屬分類(lèi): 學(xué)習(xí)筆記

只有注冊(cè)用戶(hù)登錄后才能發(fā)表評(píng)論。
【推薦】100%開(kāi)源！大型工業(yè)跨平臺(tái)軟件C++源碼提供，建模，組態(tài)！

相關(guān)文章: Nagios插件編寫(xiě)及調(diào)試方法 Linux下Nagios的安裝與配置 A collection of color schemes for some famous websites in China 工具集 Meteor：讓實(shí)時(shí)Web App成為主流 SEH異常處理學(xué)習(xí)總結(jié) 九種引人矚目的開(kāi)源大數(shù)據(jù)技術(shù) linux 維護(hù) phonegap js 和本地代碼調(diào)用原理(轉(zhuǎn)) 瀏覽器探究——執(zhí)行網(wǎng)頁(yè)跳轉(zhuǎn) (轉(zhuǎn))

網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問(wèn) Chat2DB 管理

常用鏈接

隨筆分類(lèi)

Blog

Company

Friends&Acmers

QT

搜索

最新評(píng)論

閱讀排行榜