??xml version="1.0" encoding="utf-8" standalone="yes"?>久久精品国产99久久久古代,精品国产99久久久久久麻豆,欧美亚洲国产精品久久http://www.shnenglu.com/woaidongmao/category/8755.html文章均收录自他h博客Q但不喜标题前加-[转脓]Q因其丑陋,见谅Q~zh-cnThu, 10 Sep 2009 18:47:18 GMTThu, 10 Sep 2009 18:47:18 GMT60怎样学习使用libiconv?/title><link>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95869.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:52:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95869.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/95869.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95869.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/95869.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/95869.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="margin-bottom: 12pt; line-height: 150%"><b><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">    libiconv</span></b><b><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">?/span></b><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">是一个基?span lang="EN-US">GNU</span>协议的开源库Q主要是解决多语a~码处理转换{应用问题?span lang="EN-US"><br>    </span>怎样学习使用<span lang="EN-US">libiconv</span>库?对于刚接触到人来_q篇文章不妨ȝ一看,若已l用到过该库的hQ在应用的过E中可能遇到一些问题,我们可以一h探讨Q我的联pL式是 <span lang="EN-US"><a href="mailto:cnangel@gmail.com"><span style="color: black">cnangel@gmail.com</span></a> </span>?span lang="EN-US"><?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">    </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">几个函数原型Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">iconv_t iconv_open(const char *tocode, const char *fromcode);<br>size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);<br>int iconv_close(iconv_t cd);<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">    </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">其中Q?span lang="EN-US"><br>iconv_open</span>是打开一个编码流Q类g打开一个编码管道(通道Q,出错则返?span lang="EN-US"> -1</span>Q?span lang="EN-US"><br>iconv</span>用于具体输入的{换,如果出错Q则q回<span lang="EN-US"> -1</span>Q否则返?span lang="EN-US"> 0</span>Q?span lang="EN-US"><br>iconv_close</span>是关闭该道Q通道Q?span lang="EN-US"><br>    </span>举个例子Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">#include <stdio.h><br>#include <string.h><br>#include <stdlib.h><br>#include <iconv.h><br><br>#define OUTLEN 255<br>int covert(char *, char *, char *, size_t , char *, size_t );<br><br>int main(int argc, char *argv[])<br>{<br>    char *input = "</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">中国<span lang="EN-US">";<br>    size_t len = strlen(input);<br>    char *output = (char *)malloc(OUTLEN);<br>    covert("UTF-8", "GBK", input, len, output, OUTLEN);<br>    printf("%s\n", output);<br>    return 0;<br>}<br><br>int covert(char *desc, char *src, char *input, size_t ilen, char *output, size_t olen)<br>{<br>    char **pin = &input;<br>    char **pout = &output;<br>    iconv_t cd = iconv_open(desc, src);<br>    if (cd == (iconv_t)-1)<br>    {<br>        return -1;<br>    }<br>    memset(output, 0, olen);<br>    if (iconv(cd, pin, &ilen, pout, &olen)) return -1;<br>    iconv_close(cd);<br>    return 0;<br>}<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">    </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">q里?span lang="EN-US">covert</span>函数是用于编码进行{换,其中要注意的地方?span lang="EN-US">iconv</span>函数的传递参敎ͼ<span lang="EN-US"><br>1</span>Q?span lang="EN-US">iconv</span>传递有<span lang="EN-US">5</span>个参敎ͼ<span lang="EN-US"><br>2</span>Q第<span lang="EN-US">3</span>个参数和W?span lang="EN-US">5</span>个参C般是<span lang="EN-US">input</span>?span lang="EN-US">output</span>实际分配的大,一般是<span lang="EN-US"> sizeof(type)*strlen(string)</span>Q?span lang="EN-US"><br>3</span>Q第<span lang="EN-US">4</span>个参数是不能直接传递指针的地址Q因?span lang="EN-US">iconv</span>函数能够改变指针的|所以需要复制一份指针变量;<span lang="EN-US"><br>    </span>如果对于大量需要{换的~码Q上q函?span lang="EN-US">covert</span>不适合该方式,一是内存的限制不能一ơ调用,二是若分多次调用会频J打开一个编码管道(通道Q,D资源费Q最好的办法q是拆开该函数根据情况用?span lang="EN-US"><br>    </span>q里补充一下代码:<span lang="EN-US"><br>translateSP.h</span>Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"> #ifndef __TRANSLATESP_H_<br> #define __TRANSLATESP_H_<br> #include <iconv.h><br> <br> class TranslateSP<br> {<br>     public:<br>         TranslateSP():i_cd(0){}<br>         TranslateSP(const char *from_charset,const char *to_charset)<br>         {<br>             i_cd = iconv_open(to_charset, from_charset);<br>             if ((iconv_t)-1 == i_cd) printf("iconv open error!\n");<br>         }<br>         ~TranslateSP()<br>         {   <br>             if (i_cd)<br>                 iconv_close(i_cd);<br>         }<br> <br>     public:<br>         size_t translate(char *src, size_t srcLen, char *desc, size_t descLen);<br>         size_t convert(const char *from_charset, const char *to_charset, <br>                 char *src, size_t srcLen, char *desc, size_t descLen);<br> <br>     private:<br>         iconv_t i_cd;<br> };<br> <br> #endif<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">translateSP.cpp</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"> #include "translateSP.h"<br> <br> #define MAX_LEN 200<br> <br> size_t TranslateSP::translate(char *src, size_t srcLen, char *desc, size_t descLen)<br> {<br>     char **inbuf = &src;<br>     char **outbuf = &desc;<br>     memset(desc, 0, descLen);<br>     return iconv(i_cd, inbuf, &srcLen, outbuf, &descLen);<br> }<br> <br> size_t TranslateSP::convert(const char *from_charset, const char *to_charset, <br>         char *src, size_t srcLen, char *desc, size_t descLen)<br> {<br>     char **inbuf = &src;<br>     char **outbuf = &desc;<br>     iconv_t cd = iconv_open(to_charset, from_charset);<br>     if ((iconv_t)-1 == cd) return (size_t)-1;<br>     memset(desc, 0, descLen);<br>     size_t n = iconv(cd, inbuf, &srcLen, outbuf, &descLen);<br>     iconv_close(cd);<br>     return n;<br> }<br> <br> int main(int argc, char *argv[])<br> {<br>     char *str = "</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">我爱<span lang="EN-US">zhong</span>?span lang="EN-US">! </span>Q#Q#<span lang="EN-US">";<br>     char *str1 = "i</span>大量需要{换的~码<span lang="EN-US">";<br>     char *str2 = "</span>函数是用于?span lang="EN-US">hello</span>q行转换<span lang="EN-US">";<br>     char newstr[MAX_LEN];<br>     TranslateSP tsp;<br>     tsp.convert("utf-8", "gbk", str, strlen(str), newstr, MAX_LEN);<br>     printf("%s\n", newstr);<br>     TranslateSP newtsp("UTF-8", "GBK");<br>     newtsp.translate(str1, strlen(str1), newstr, MAX_LEN);<br>     printf("%s\n", newstr);<br>     newtsp.translate(str2, strlen(str2), newstr, MAX_LEN);<br>     printf("%s\n", newstr);<br>     return 0;<br> }<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">~译Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">g++ translateSP.cpp -o test<br>./test<br></span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">我爱<span lang="EN-US">zhong</span>?span lang="EN-US">! </span>Q#Q#<span lang="EN-US"><br>i</span>大量需要{换的~码<span lang="EN-US"><br></span>函数是用于?span lang="EN-US">hello</span>q行转换<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">(</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">以上输出?span lang="EN-US">GBK</span>~码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"><o:p> </o:p></span></p><img src ="http://www.shnenglu.com/woaidongmao/aggbug/95869.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:52 <a href="http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95869.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>unicode utf-8 gb18030 gb2312 gbk各种~码Ҏhttp://www.shnenglu.com/woaidongmao/archive/2009/09/10/95868.html肥仔肥仔Thu, 10 Sep 2009 15:42:00 GMThttp://www.shnenglu.com/woaidongmao/archive/2009/09/10/95868.htmlhttp://www.shnenglu.com/woaidongmao/comments/95868.htmlhttp://www.shnenglu.com/woaidongmao/archive/2009/09/10/95868.html#Feedback0http://www.shnenglu.com/woaidongmao/comments/commentRss/95868.htmlhttp://www.shnenglu.com/woaidongmao/services/trackbacks/95868.html阅读全文

肥仔 2009-09-10 23:42 发表评论
]]>
GB18030~码研究以及GBK、GB18030与Unicode的映?/title><link>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95867.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:37:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95867.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/95867.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95867.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/95867.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/95867.html</trackback:ping><description><![CDATA[     摘要: GB18030有两个版本:GB18030-2000和GB18030-2005。在本文中,没有指明版本的GB18030是指GB18030-2005。本文讨Z以下问题Q?1.           GB2312?82个图形符P都放?区。GBK?区有717个图形符P5区有 166个图形符P一?..  <a href='http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95867.html'>阅读全文</a><img src ="http://www.shnenglu.com/woaidongmao/aggbug/95867.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:37 <a href="http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95867.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>GBK, UCS和UTF8怺转换http://www.shnenglu.com/woaidongmao/archive/2009/09/10/95864.html肥仔肥仔Thu, 10 Sep 2009 15:13:00 GMThttp://www.shnenglu.com/woaidongmao/archive/2009/09/10/95864.htmlhttp://www.shnenglu.com/woaidongmao/comments/95864.htmlhttp://www.shnenglu.com/woaidongmao/archive/2009/09/10/95864.html#Feedback0http://www.shnenglu.com/woaidongmao/comments/commentRss/95864.htmlhttp://www.shnenglu.com/woaidongmao/services/trackbacks/95864.html阅读全文

肥仔 2009-09-10 23:13 发表评论
]]>
CE序实现汉字内码与GB?/title><link>http://www.shnenglu.com/woaidongmao/archive/2008/11/08/66314.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Sat, 08 Nov 2008 04:17:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2008/11/08/66314.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/66314.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2008/11/08/66314.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/66314.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/66314.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> <span lang="EN-US">// HZEncode.cpp : Defines the entry point for the console application.<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">//<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">/*<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  参考文献:<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字的编码和表示<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">1)</span>汉字交换?span lang="EN-US">(</span>国标?span lang="EN-US">) </span>汉字交换?span lang="EN-US">(</span>国标?span lang="EN-US">)</span>主要用于汉字信息交换?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  国标码:以国家标准局<span lang="EN-US">1980</span>q颁布的《信息交换用汉字~码字符?span lang="EN-US">"</span>基本集?span lang="EN-US">(</span>代号?span lang="EN-US">GB2312 80)</span>规定的汉字交换码作ؓ国家标准汉字~码?<span lang="EN-US">GB2312 80</span>中共?span lang="EN-US">7445</span>个字W符P 汉字W号<span lang="EN-US">6763</span>?一U汉?span lang="EN-US">3755</span>?span lang="EN-US">(</span>按汉语拼韛_母顺序排?span lang="EN-US">) </span>二汉字<span lang="EN-US">3008</span>?span lang="EN-US">(</span>按部首笔划顺序排?span lang="EN-US">) </span>非汉字符?span lang="EN-US">682</span>?span lang="EN-US"> GB2312 80</span>规定Q所有的国标码汉字及W号l成一?span lang="EN-US">94 94</span>的方c在此方阵中Q每一行称Z?span lang="EN-US">"</span>?span lang="EN-US">"</span>Q每一列称Z?span lang="EN-US">"</span>?span lang="EN-US">"</span>。这个方阵实际上l成一个有<span lang="EN-US">94</span>个区<span lang="EN-US">(</span>~号?span lang="EN-US">01</span>?span lang="EN-US">94)</span>Q每个区?span lang="EN-US">94</span>个位<span lang="EN-US">(</span>~号?span lang="EN-US">01</span>?span lang="EN-US">94)</span>的汉字字W集?一个汉字所在的区号和位Ll合构成了该汉字的<span lang="EN-US">"</span>Z?span lang="EN-US">"</span>。其中,高两位ؓ区号Q低两位Z受这样区位码可以唯一地确定某一汉字或字W?span lang="EN-US">;</span>反之QQ何一个汉字或W号都对应一个唯一的区位码Q没有重码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  Z码分布情况如下:<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ????span lang="EN-US"> 1</span>?键盘上没有的各种W号<span lang="EN-US"> 2</span>?各种序号<span lang="EN-US"> 3</span>?键盘上的各种W号<span lang="EN-US">(</span>按中文方式给?span lang="EN-US">) 4 -5</span>?日文字母<span lang="EN-US"> 6</span>?希腊字母<span lang="EN-US"> 7</span>?俄文字母<span lang="EN-US"> 8</span>?标识拼音声调的母韛_拼音字母名称<span lang="EN-US"> 9</span>?制表W号<span lang="EN-US"> 10- 15</span>?未用<span lang="EN-US"> 16-55</span>?一U汉?span lang="EN-US">(</span>按拼韛_母顺序排?span lang="EN-US">) 56- 87</span>?二汉字<span lang="EN-US">(</span>按部首笔划顺序排?span lang="EN-US">) 88- 94</span>?自定义汉?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ׃可以看出Q所有汉字与W号?span lang="EN-US">94</span>个区Q可以分为四个组Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ?/span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">1 -15</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区:为图形符号区。其?span lang="EN-US">1 9</span>Zؓ标准W号?span lang="EN-US">;10 15</span>Zؓ自定义符号区?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ?/span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">16 -55</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区:ZU汉字区Q包?span lang="EN-US">3755</span>个汉字。这些区中的汉字按汉语拼音顺序排序,同音字按W画序列出?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ?/span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">56 -87</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区:ZU汉字区Q包?span lang="EN-US">3008</span>个汉字。这些区中的汉字是按部首W划序排序的?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  ?/span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">88 -94</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区:定义汉字区?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  国标码规定,每个汉字<span lang="EN-US">(</span>包括非汉字的一些符?span lang="EN-US">)</span>?span lang="EN-US">2</span>字节代码表示。每个字节的最高位?span lang="EN-US">0</span>Q只使用?span lang="EN-US">7</span>位,而低<span lang="EN-US">7</span>位的~码中又?span lang="EN-US">34</span>个适用于控制用的,q样每个字节只有<span lang="EN-US">27 - 34 = 94</span>个编码用于汉字?span lang="EN-US">2</span>个字节就?span lang="EN-US">94 94=8836</span>个汉字编码。在表示一个汉字的<span lang="EN-US">2</span>个字节中Q高字节对应~码表中的行PUCؓ区号<span lang="EN-US">;</span>低字节对应编码表中的列号Q称Z受?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字国标码的范围用二q制表示是:<span lang="EN-US"> 00100001 00100001 01111110 01111110 (1+32)10 (1+32)10 (94+32)10 (94+32)10 7 </span>?span lang="EN-US">ASCII</span>码是<span lang="EN-US">128</span>个字W组成的字符集。其中编码?span lang="EN-US">0 31(00000000 00011111)</span>不对应Q何印刷字W,通常UCؓ控制W,用于计算机通信中的通信控制或对计算备的功能控制。编码?span lang="EN-US">32(00100000)</span>是空格字W?span lang="EN-US">SP</span>。编码?span lang="EN-US">127(1111111)</span>是删除字W?span lang="EN-US">DEL</span>?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字国标码的起始二进制位|选择<span lang="EN-US">00100001</span>?span lang="EN-US">(33)10</span>是ؓ了蟩q?span lang="EN-US">ASCII</span>码的<span lang="EN-US">32</span>个控制字W和I格字符。所以,汉字国标码的高位和低位分别比对应的区位码?span lang="EN-US">(32)10</span>?span lang="EN-US">(00100000)2</span>?span lang="EN-US">(20)H</span>Q即Q?国标码高?span lang="EN-US"> = </span>区码<span lang="EN-US"> + 20H (H</span>表示十六q制<span lang="EN-US">) </span>国标码低?span lang="EN-US"> = </span>位码<span lang="EN-US"> + 20H<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">2) </span>汉字机内?span lang="EN-US">(</span>内码<span lang="EN-US">)(</span>汉字存储?span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字机内?span lang="EN-US">(</span>内码<span lang="EN-US">)(</span>汉字存储?span lang="EN-US">)</span>的作用是l一了各U不同的汉字输入码在计算机内部的表示。ؓ了将汉字的各U输入码在计机内部l一hQ就有了专用于计机内部存储汉字使用的汉字机内码Q用以将输入时用的多种汉字输入码统一转换成汉字机内码q行存储Q以方便机内的汉字处理汉字机内码是在计算机内部存储、处理的代码。计机既要处理汉字Q又要处理英文。因此计机必须能区别汉字字W和英文字符。英文字W的的机内码是最高ؓ?span lang="EN-US">0</span>?span lang="EN-US">8</span>?span lang="EN-US">ASCII</span>码。ؓ了不?span lang="EN-US">7</span>?span lang="EN-US">ASCII</span>码发生冲H,把国标码每个字节的最高位?span lang="EN-US">0</span>改ؓ<span lang="EN-US">1</span>Q其余位不变的编码作为汉字字W的机内码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字机内码的范围用二q制表示是:<span lang="EN-US"> 10100001 10100001 11111110 11111110 </span>机内码的高位和低位比对应的国标码的高位和低位?span lang="EN-US">(128)10</span>?span lang="EN-US">(10000000)2</span>?span lang="EN-US">(80)H </span>卻I 机内码高?span lang="EN-US"> = </span>国标码高?span lang="EN-US"> + 80H </span>机内码低?span lang="EN-US"> = </span>国标码低?span lang="EN-US"> + 80H </span>又因为: 国标码高?span lang="EN-US"> = </span>区码<span lang="EN-US"> + 20H </span>国标码低?span lang="EN-US"> = </span>位码<span lang="EN-US"> + 20H </span>所以: 机内码高?span lang="EN-US"> = </span>区码<span lang="EN-US"> + A0H </span>机内码低?span lang="EN-US"> = </span>位码<span lang="EN-US"> + A0H </span>也就是说Q机内码高位和机内码低位分别比对应的区码和位码大<span lang="EN-US">(160)10</span>?span lang="EN-US">(10100000)2</span>?span lang="EN-US"> (A0)H </span>例如Q汉?span lang="EN-US">"</span>?span lang="EN-US">"</span>的区位码?span lang="EN-US">"1601"</span>Q其中区码ؓ<span lang="EN-US">(16)10</span>?span lang="EN-US">(10)H</span>Q位码ؓ<span lang="EN-US">(01)10</span>?span lang="EN-US">(01)H</span>?则: 机内码高?span lang="EN-US"> = 10H + A0H = B0H </span>机内码低?span lang="EN-US"> = 01H + A0H = A1H </span>所以: 机内?span lang="EN-US">= B<?xml:namespace prefix = st1 /><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="0" unitname="a">0A</st1:chmetcnv>1H<o:p></o:p></span></span></p> <div align="center"> <table class="MsoNormalTable" style="width: 95%; mso-cellspacing: 0cm; mso-padding-alt: 4.5pt 4.5pt 4.5pt 4.5pt" cellspacing="0" cellpadding="0" width="95%" border="0"> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes; mso-yfti-lastrow: yes"> <td style="padding-right: 4.5pt; padding-left: 4.5pt; background: #f3f3f3; padding-bottom: 4.5pt; padding-top: 4.5pt"> <p class="MsoNormal" style="text-align: left; mso-pagination: widow-orphan" align="left"><b><span style="font-size: 12pt; color: #990000; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">以下是引用片D:</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><br><!--[if !supportEmptyParas]--> <!--[endif]--><o:p></o:p></span></p></td></tr></tbody></table></div> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">3) </span>汉字输入?span lang="EN-US">(</span>外码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字输入?span lang="EN-US">(</span>外码<span lang="EN-US">)</span>是ؓ了通过键盘字符把汉字输入计机而设计的一U编码?英文输入Ӟ相输入什么字W便按什么键Q输入码和机内码一致。汉字输入时Q可能要按几个键才能输入一个汉字。汉字输入方案有成百上千个,但是q千差万别的外码输入q计机后都会{换成l一的内码?汉字输入Ҏ大致可分Z?span lang="EN-US">4</span>U类型:<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">(1) </span>音码Q如全拼、双拹{微软拼音等<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">(2) </span>形码Q如五笔字型、郑码、表形码{?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">(3) </span>韛_Ş码:如智?span lang="EN-US">ABC</span>、自然码{?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">(4) </span>数字码:如区位码、电报码{?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">4) </span>汉字字Ş?span lang="EN-US">(</span>输出?span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字字Ş?span lang="EN-US">(</span>输出?span lang="EN-US">)</span>用于汉字的显C和打印Q是汉字字Ş的数字化信息?汉字的内码是用数字代码来表示汉字Q但是ؓ了在输出时让Z看到汉字Q就必须输出汉字的字形。在汉字pȝ中,一般采用点阉|表示字Ş?span lang="EN-US"> 16 *16</span>汉字炚wC意<span lang="EN-US"> 16 * 16</span>炚w字Ş的字要?span lang="EN-US">32</span>个字?span lang="EN-US">(16 * 16/8= 32)</span>存储Q?span lang="EN-US">24 * 24</span>炚w字Ş的字要?span lang="EN-US">72</span>个字?span lang="EN-US">(24 * 24/8=72)</span>存储?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  一般来_表现汉字时用的炚w大Q则汉字字Ş的质量也好Q当然每个汉字点阉|需的存储量也越大?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  <span lang="EN-US">5) </span>汉字地址?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">  汉字地址码是指汉字库<span lang="EN-US">(</span>q里主要指整字Ş的点阵式字模?span lang="EN-US">)</span>中存储汉字字形信息的逻辑地址。在汉字库中Q字形信息都是按一定顺?span lang="EN-US">(</span>大多数按标准汉字交换码中汉字的排列顺?span lang="EN-US">)</span>q箋存放在存储介质上的,所以汉字地址码也大多是连l有序的Q而且与汉字内码间有着单的对应关系Q以化汉字内码到汉字地址码的转换?span lang="EN-US"><o:p></o:p></span></span></p> <div align="center"> <table class="MsoNormalTable" style="width: 95%; mso-cellspacing: 0cm; mso-padding-alt: 4.5pt 4.5pt 4.5pt 4.5pt" cellspacing="0" cellpadding="0" width="95%" border="0"> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes; mso-yfti-lastrow: yes"> <td style="padding-right: 4.5pt; padding-left: 4.5pt; background: #f3f3f3; padding-bottom: 4.5pt; padding-top: 4.5pt"> <p class="MsoNormal" style="text-align: left; mso-pagination: widow-orphan" align="left"><b><span style="font-size: 12pt; color: #990000; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">以下是引用片D:</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><br>*/  <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>#include "stdafx.h" <br>#include "HZEncode.h" <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>#ifdef _DEBUG <br>#define new DEBUG_NEW <br>#undef THIS_FILE <br>static char THIS_FILE[] = __FILE__; <br>#endif <br>#define UNICODE <br>#define _UNICODE <br>///////////////////////////////////////////////////////////////////////////// <br>// The one and only application object <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>CWinApp theApp; <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>using namespace std; <br>unsigned short* ptr; <br>char* pszHZ = "</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">?span lang="EN-US">"; <br>byte bt[] = {0xc4,0xe3,0xBA,0xC3};//?/span>你好<span lang="EN-US">?/span>的机内码<span lang="EN-US"> <br>int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) <br>{ <br>       int nRetCode = 0; <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>       // initialize MFC and print and error on failure <br>       if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) <br>       { <br>              // TODO: change error code to suit your needs <br>              cerr << _T("Fatal Error: MFC initialization failed") << endl; <br>              nRetCode = 1; <br>       } <br>       else <br>       { <br>              for (int i = 16;i <= 55; i++) <br>              { <br>                     byte Temp[3]; <br>                     Temp[2] = 0; <br>                     Temp[0] = i + 0xA0; <br>                     for (int j = 1;j < 94;j++) <br>                     { <br>                             <br>                            Temp[1] = j + 0xA0; <br>                            cout << (LPCTSTR) Temp; <br>                             <br>                     } <br>                     cout << endl; <br>              } <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>       } <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>       system("pause"); <br>       return nRetCode; <br>} <br><!--[if !supportEmptyParas]--> <!--[endif]--> <br>  <br><!--[if !supportEmptyParas]--> <!--[endif]--><o:p></o:p></span></span></p></td></tr></tbody></table></div> <p class="MsoNormal" style="text-align: left" align="left"><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: arial"><o:p> </o:p></span></p><img src ="http://www.shnenglu.com/woaidongmao/aggbug/66314.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-08 12:17 <a href="http://www.shnenglu.com/woaidongmao/archive/2008/11/08/66314.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>C++的三U字W编码方?/title><link>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66259.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 15:27:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66259.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/66259.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66259.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/66259.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/66259.html</trackback:ping><description><![CDATA[<p></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">c++</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">通常使用的是三种~码方式Q分别是<span lang="EN-US">SBCS(single byte character set),MBCS(multi-byte characterset)</span>?span lang="EN-US">Unicode</span>字符集?span lang="EN-US">SBCS</span>是一个字节一个字W,<span lang="EN-US">MBCS</span>是几个字节一个字W,可能是一个,两个Q三个不{,但是实际上,l大多数时候用两个字节的Q所以有时候看?span lang="EN-US">DBCS(double-byte character set)</span>代替<span lang="EN-US">MBCS</span>也不奇怪;<span lang="EN-US">Unicode</span>一律是两个字节~码。在<span lang="EN-US">windows nt</span>内核中,<span lang="EN-US">API</span>一律用的?span lang="EN-US">unicode</span>~码Q所以如果你在编写Y件过E中使用?span lang="EN-US">unicode</span>~码方式Q系l也会自动{换成<span lang="EN-US">unicode</span>执行Q然后返回的l构再{换ؓ你用的cd。单字节表示?span lang="EN-US">char</span>Q?span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t.</span>我们是在单字节的光芒下成长v来的Q一旉完全抛弃单字节未免难以接受,但是有些时候我们又不可避免的需要?span lang="EN-US">unicode</span>字符集合Q那?span lang="EN-US">ms</span>提供的解军_法是泛_Q?span lang="EN-US">TChar<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们看看他的定义Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">#ifdef UNICODE<br>typedef wchar_t TCHAR;<br>#else<br>typedef char TCHAR;<br>#endif<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">ok</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">Q一切问题都解决了,我们只需要定?span lang="EN-US">UNICODE</span>׃样?span lang="EN-US">wchar_t,</span>是很方便。另外,?span lang="EN-US">windows</span>?span lang="EN-US">COM</span>中用的一律是<span lang="EN-US">unicode</span>Q但?span lang="EN-US">MFC</span>默认的确?span lang="EN-US">MBCS</span>Q所以你?span lang="EN-US">MFC</span>写的cd如果攑ֈ?span lang="EN-US">COM</span>下,有些字符的格式化方式或者返回值错误的Q原因就?span lang="EN-US">com</span>一律?span lang="EN-US">unicode</span>Q?span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t('00')</span>l尾Q?span lang="EN-US">char</span>却是使用<span lang="EN-US">'0'</span>l尾的。一般情况下Q普通字W需要加?span lang="EN-US">_T</span>宏才能正常运行,比如<span lang="EN-US">MFC</span>中你写道<span lang="EN-US">S = "FSDFSDF",</span>那么该类转到<span lang="EN-US">COM</span>下,需要写<span lang="EN-US">S = _T("FSDFSDF")</span>Q才可以。我们可以想象宏<span lang="EN-US">_T</span>?span lang="EN-US">TCHAr</span>的功能一P如果使用<span lang="EN-US">UNICODE</span>p动在<span lang="EN-US">constant string</span>前面加上<span lang="EN-US">L</span>Q否则就直接使用?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们说一些小问题Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">VC6</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">生成?span lang="EN-US">console application</span>?span lang="EN-US"><br>int main(int argc, char* argv[])<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">VS C++ 2005</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">生成的是<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">int _tmain(int argc, _TCHAR* argv[])<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">昄Q用<span lang="EN-US">_tmain</span>更好Q?span lang="EN-US">why?<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">You can also use <b>_tmain</b>, which is defined in TCHAR.h. <b>_tmain</b> will resolve to <b>main</b> unless _UNICODE is defined, in which case <b>_tmain</b> will resolve to <b>wmain</b>.(<a ><span style="color: black">http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx</span></a>#).<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们也会常常看到如下一些字W类型,<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">WCHAR wchar_t wchar_t <br>LPSTR zero-terminated string of char (char*) zero-terminated string of char (char*) <br>LPCSTR constant zero-terminated string of char (const char*) constant zero-terminated string of char (const char*) <br>LPWSTR zero-terminated Unicode string (wchar_t*) zero-terminated Unicode string (wchar_t*) <br>LPCWSTR constant zero-terminated Unicode string (const wchar_t*) constant zero-terminated Unicode string (const wchar_t*) <br>TCHAR char wchar_t <br>LPTSTR zero-terminated string of TCHAR (TCHAR*) zero-terminated string of TCHAR (TCHAR*) <br>LPCTSTR constant zero-terminated string of TCHAR (const TCHAR*) constant zero-terminated string of TCHAR (const TCHAR*) <br>C </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">一般代?span lang="EN-US">constant</span>Q?span lang="EN-US">P</span>代表指针Q?span lang="EN-US">LP</span>代表长指?span lang="EN-US">,W</span>代表宽字W,也就?span lang="EN-US">UNICODE</span>Q这下是不是都能明白q些是干什么的了?<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们也会常常看到<span lang="EN-US">_mbsstr()</span>q样的函敎ͼq就?span lang="EN-US">MBCS</span>字符~码的函敎ͼ当然可以处理<span lang="EN-US">SBCS</span>~码Q但是反之却不行。所以ؓ了保险v见,我们可以使用<span lang="EN-US">_mbsstr</span>代替<span lang="EN-US">strstr,</span>但是如果E序只是处理<span lang="EN-US">SBCS</span>Q那么显然又影响效率Q所以到底用什么方式同时满x率和可移植性,自己掂量着办吧?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">以后使用<span lang="EN-US">C++</span>~写E序Q如果出Cؕ码,首先?span lang="EN-US">C++</span>的编码类型,而且一般情况下都是l束W号没有弄对Q?span lang="EN-US">SBCS</span>?span lang="EN-US">MBCS</span>都是以单字节<span lang="EN-US">0</span>l尾Q?span lang="EN-US">UNICODE</span>是以双字?span lang="EN-US">00</span>l尾的?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><b style="mso-bidi-font-weight: normal"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"><o:p> </o:p></span></b></p></span><img src ="http://www.shnenglu.com/woaidongmao/aggbug/66259.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 23:27 <a href="http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66259.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>字符~码方式基本知识http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66252.html肥仔肥仔Fri, 07 Nov 2008 14:43:00 GMThttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66252.htmlhttp://www.shnenglu.com/woaidongmao/comments/66252.htmlhttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66252.html#Feedback0http://www.shnenglu.com/woaidongmao/comments/commentRss/66252.htmlhttp://www.shnenglu.com/woaidongmao/services/trackbacks/66252.htmlASCIIQ基本字W集?span lang="EN-US">128个常用字W,扩展字符集是128个,?span lang="EN-US">256个,?span lang="EN-US">1个字节表C?span lang="EN-US">
GB2312Q?span lang="EN-US">6千多个常用汉?span lang="EN-US">
GBKQ?span lang="EN-US">1万多个汉?span lang="EN-US">
GB18030Q更多,不过依然是两个字节来表示汉字?span lang="EN-US">
上面三种GB*可以l一UCؓANSI~码Q且16?span lang="EN-US">bit的第一个必定是1?span lang="EN-US">
BIG5Q繁体字W集Q用于台湑֜?span lang="EN-US">

UnicodeQ两字节表示的世界通用码,存储为文本时会有q个字节的头信息?span lang="EN-US">
UTF-8
Q一U以8?span lang="EN-US">bitZl的Unicode的表C格式,存储为本文时有三个字节的头信息?span lang="EN-US">
UTF-16Q?span lang="EN-US">16?span lang="EN-US">bitZl?span lang="EN-US">

单词全称Q?span lang="EN-US">
ASCII: American Standard Code Information Interchange
ANSI: American National Standard Institue
GB: Guo Biao
UTF: Unicode Translation Format

========================================================
字符是各U文字和W号的ȝQ包括各国家文字、标点符受图形符受数字等。字W集是多个字W的集合Q字W集U类较多Q每个字W集包含的字W个C同,常见字符集名UͼASCII字符集?span lang="EN-US">GB2312字符集?span lang="EN-US">BIG5字符集?span lang="EN-US"> GB 18030字符集?span lang="EN-US">Unicode字符集等。计机要准的处理各种字符集文字,需要进行字W编码,以便计算够识别和存储各种文字?span lang="EN-US">

中文文字数目大,而且q分为简体中文和J体中文两种不同书写规则的文字,而计机最初是按英语单字节字符设计的,因此Q对中文字符q行~码Q是中文信息交流的技术基。本文将按照字符集的旉序讨论几种典型的字W集Q选取几种代表性的中文字符集,研究历史由来、特炏V技术特征?span lang="EN-US">

ASCII 字符?span lang="EN-US">

1Q名U的由来

ASCIIQ?span lang="EN-US">American Standard Code for Information InterchangeQ美国信息互换标准代码)是基于罗马字母表的一套电脑编码系l?span lang="EN-US">

2Q特?span lang="EN-US">

它主要用于显C现代英语和其他西欧语言。它是现今最通用的单字节~码pȝQƈ{同于国际标?span lang="EN-US">ISO 646?span lang="EN-US">

3Q包含内?span lang="EN-US">

控制字符Q回车键、退根{换行键{?span lang="EN-US">

可显C字W:英文大小写字W、阿拉伯数字和西文符?span lang="EN-US">

4Q技术特?span lang="EN-US">

7位(bitsQ表CZ个字W,?span lang="EN-US">128字符

5Q?span lang="EN-US">ASCII扩展字符?span lang="EN-US">

7位编码的字符集只能支?span lang="EN-US">128个字W,Z表示更多的欧z常用字W对ASCIIq行了扩展,ASCII扩展字符集?span lang="EN-US">8位(bitsQ表CZ个字W,?span lang="EN-US">256字符?span lang="EN-US">

ASCII扩展字符集比ASCII字符集扩充出来的W号包括表格W号、计符受希腊字母和Ҏ的拉丁符受?span lang="EN-US">

GB2312 字符?span lang="EN-US">

1Q名U的由来

GB2312又称?span lang="EN-US">GB2312-80字符集,全称为《信息交换用汉字~码字符?span lang="EN-US">·基本集》,由原中国国家标准d发布Q??xml:namespace prefix = st1 />1981q?span lang="EN-US">5?span lang="EN-US">1?/st1:chsdate>实施?span lang="EN-US">

2Q特?span lang="EN-US">

GB2312是中国国家标准的体中文字W集。它所收录的汉字已l覆?span lang="EN-US">99.75%的用频率,基本满了汉字的计算机处理需要。在中国大陆和新加坡获广泛用?span lang="EN-US">

3Q包含内?span lang="EN-US">

GB2312收录化汉字及一般符受序受数字、拉丁字母、日文假名、希腊字母、俄文字母、汉语拼音符受汉语注韛_母,?span lang="EN-US"> 7445 个图形字W。其中包?span lang="EN-US">6763个汉字,其中一U汉?span lang="EN-US">3755个,二汉字3008个;包括拉丁字母、希腊字母、日文^假名及片假名字母、俄语西里尔字母在内?span lang="EN-US">682个全角字W?span lang="EN-US">

4Q技术特?span lang="EN-US">

Q?span lang="EN-US">1Q分C:

GB2312中对所收汉字进行了?/span>分区?/span>处理Q每区含?span lang="EN-US">94个汉?span lang="EN-US">/W号。这U表C方式也UCؓZ码?span lang="EN-US">

各区包含的字W如下:01-09ZؓҎW号Q?span lang="EN-US">16-55Zؓ一U汉字,按拼x序;56-87Zؓ二汉字Q按部首/W画排序Q?span lang="EN-US">10-15区及88-94区则未有~码?span lang="EN-US">

Q?span lang="EN-US">2Q双字节表示

两个字节中前面的字节为第一字节Q后面的字节为第二字节。习惯上U第一字节?span lang="EN-US">?/span>高字?span lang="EN-US">?Q而称W二字节?span lang="EN-US">?/span>低字?span lang="EN-US">?/span>?span lang="EN-US">

?/span>高位字节?/span>使用?span lang="EN-US">0xA1-0xF7(?span lang="EN-US">01-87区的区号加上0xA0)Q?span lang="EN-US">?/span>低位字节?/span>使用?span lang="EN-US">0xA1-0xFE(?span lang="EN-US">01-94加上0xA0)?span lang="EN-US">

5Q编码D?span lang="EN-US">

?span lang="EN-US">GB2312字符集的W一个汉?span lang="EN-US">?/span>?span lang="EN-US">?/span>字ؓ例,它的区号16Q位?span lang="EN-US">01Q则Z码是1601Q在大多数计机E序中,高字节和低字节分别加0xA0得到E序的汉字处理编?span lang="EN-US">0xB0A1。计公式是Q?span lang="EN-US">0xB0=0xA0+16, 0xA1=0xA0+1?span lang="EN-US">

BIG5 字符?span lang="EN-US">

1Q名U的由来

又称?st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="5" unitname="?>五码或五大码Q?span lang="EN-US">1984q由台湾财团法h信息工业{进会和五间软g公司宏碁 (Acer)、神?span lang="EN-US"> (MiTAC)、佳佟뀁零?span lang="EN-US"> (Zero One)、大?span lang="EN-US"> (FIC)创立Q故U大

Big?/span>的生,是因为当时台湾不同厂商各自推Z同的~码Q如倚天码、IBM PS55、王安码{,彼此不能兼容Q另一斚wQ台湾政府当时尚未推出官方的汉字~码Q而中国大陆的GB2312~码亦未有收录繁体中文字?span lang="EN-US">

2Q特?span lang="EN-US">

Big5字符集共收录13,053个中文字Q该字符集在中国台湾使用。耐hd的是该字W集重复地收录了两个相同的字Q?span lang="EN-US">?/span>兀?0xA461?span lang="EN-US">0xC94A)?span lang="EN-US">?/span>嗀?0xDCD1?span lang="EN-US">0xDDFC)?span lang="EN-US">

3Q字W编码方?span lang="EN-US">

Big?/span>使用了双字节储存ҎQ以两个字节来编码一个字。第一个字节称为?/span>高位字节?/span>Q第二个字节UCؓ?/span>低位字节?/span>。高位字节的~码范围0xA1-0xF9Q低位字节的~码范围0x40-0x7E?span lang="EN-US">0xA1-0xFE?span lang="EN-US">

各编码范围对应的字符cd如下Q?span lang="EN-US">0xA140-0xA3BF为标点符受希腊字母及ҎW号Q另外于0xA259-0xA261Q存放了双音节度量衡单位用字Q兙兛兞兝兡兣嗧瓩糎Q?span lang="EN-US">0xA440-0xC67E为常用汉字,先按W划再按部首排序Q?span lang="EN-US">0xC940-0xF9D5为次常用汉字Q亦是先按笔划再按部首排序?span lang="EN-US">

4Q?span lang="EN-US">Big5 的局限?span lang="EN-US">

Big?/span>内包含一万多个字W,但是没有考虑C会上流通的人名、地名用字、方a用字、化学及生物U等用字Q没有包含日文^假名及片假名字母?o:p>

例如台湾?span lang="EN-US">?/span>着?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>的异体字Q故没有收录?/span>着?/span>字。康熙字怸的一些部首用?span lang="EN-US">(?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>{?span lang="EN-US">)、常见的人名用字(?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>?span lang="EN-US">?/span>{?span lang="EN-US">) 也没有收录到Big5之中?span lang="EN-US">

GB18030 字符?span lang="EN-US">

1Q名U的由来

GB 18030的全U是GB18030-2000《信息交换用汉字~码字符集基本集的扩充》,是我国政府于2000q?span lang="EN-US">3?span lang="EN-US">17?/st1:chsdate>发布的新的汉字编码国家标准,2001q?span lang="EN-US">8?span lang="EN-US">31?/st1:chsdate>后在中国市场上发布的软g必须W合本标?span lang="EN-US">

2Q特?span lang="EN-US">

GB 18030字符集标准的出台l过q泛参与和论证,来自国内外知名信息技术行业的公司Q信息业部和原国家质量技术监督局联合实施?span lang="EN-US">

GB 18030字符集标准解x字、日文假名、朝鲜语和中国少数民族文字组成的大字W集计算机编码问题。该标准的字Wȝ码空间超q?span lang="EN-US">150万个~码位,收录?span lang="EN-US">27484个汉字,覆盖中文、日文、朝鲜语和中国少数民族文字。满中国大陆、香港、台湾、日本和韩国{东亚地Z息交换多文种、大字量、多用途、统一~码格式的要求。ƈ且与Unicode 3.0版本兼容Q填?span lang="EN-US">Unicode扩展字符字汇?/span>l一汉字扩展A?/span>的内宏Vƈ且与以前的国家字W编码标准(GB2312Q?span lang="EN-US">GB13000.1Q兼宏V?span lang="EN-US">

3Q编码方?span lang="EN-US">

GB 18030标准采用单字节、双字节和四字节三种方式对字W编码。单字节部分使用0×00?span lang="EN-US">0×7F?span lang="EN-US">(对应?span lang="EN-US">ASCII码的相应?span lang="EN-US">)。双字节部分Q首字节码从0×81?span lang="EN-US">0×FEQ尾字节码位分别?span lang="EN-US">0×40?span lang="EN-US">0×7E?span lang="EN-US">0×80?span lang="EN-US">0×FE。四字节部分采用GB/T 11383未采用的0×30?span lang="EN-US">0×39作ؓ对双字节~码扩充的后~Q这h充的四字节编码,其范围ؓ0×81308130?span lang="EN-US">0×FE39FE39。其中第一、三个字节编码码位均?span lang="EN-US">0×81?span lang="EN-US">0×FEQ第二、四个字节编码码位均?span lang="EN-US">0×30?span lang="EN-US">0×39?span lang="EN-US">

4Q包含的内容

双字节部分收录内容主要包?span lang="EN-US">GB13000.1全部CJK汉字20902个、有x点符受表意文字描q符13个、增补的汉字和部?span lang="EN-US">/构g80个、双字节~码的欧元符L。  四字节部分收录了上述双字节字W之外的Q包?span lang="EN-US">CJKl一汉字扩充A在内?span lang="EN-US">GB 13000.1中的全部字符?span lang="EN-US">

Unicode字符?span lang="EN-US">

1Q名U的由来

Unicode字符集编码是Universal Multiple-Octet Coded Character Set 通用多八位编码字W集的简Uͼ是由一个名?span lang="EN-US"> Unicode 学术学会(Unicode Consortium)的机构制订的字符~码pȝQ支持现今世界各U不同语a的书面文本的交换、处理及昄。该~码?span lang="EN-US">1990q开始研发,1994q正式公布,最新版本是2005q?span lang="EN-US">3?span lang="EN-US">31?/st1:chsdate>?span lang="EN-US">Unicode 4.1.0?span lang="EN-US">

2Q特?span lang="EN-US">

Unicode是一U在计算Z使用的字W编码。它为每U语a中的每个字符讑֮了统一q且唯一的二q制~码Q以满跨语a、跨q_q行文本转换、处理的要求?span lang="EN-US">

3Q编码方?span lang="EN-US">

Unicode 标准始终使用十六q制数字Q而且在书写时在前面加上前~“U+?/span>Q例如字?span lang="EN-US">“A?/span>的编码ؓ 004116 和字W?span lang="EN-US">??/span>的编码ؓ 20AC16。所?span lang="EN-US">“A?/span>的编码书写ؓ“U+0041?/st1:chmetcnv>?span lang="EN-US">

4Q?span lang="EN-US">UTF-8 ~码
UTF-8
?span lang="EN-US">Unicode的其中一个用方式?span lang="EN-US"> UTF?span lang="EN-US"> Unicode Translation FormatQ即?span lang="EN-US">Unicode转做某种格式的意思?span lang="EN-US">

UTF-8便于不同的计机之间使用|络传输不同语言和编码的文字Q得双字节?span lang="EN-US">Unicode能够在现存的处理单字节的pȝ上正传输?span lang="EN-US">

UTF-8使用可变长度字节来储?span lang="EN-US"> Unicode字符Q例?span lang="EN-US">ASCII字母l箋使用1字节储存Q重x字、希腊字母或襉K字母等使用2字节来储存,而常用的汉字p使用3字节。辅助^面字W则使用4字节?span lang="EN-US">

5Q?span lang="EN-US">UTF-16 ?span lang="EN-US"> UTF-32 ~码
UTF-32
?span lang="EN-US">UTF-16 ?span lang="EN-US"> UTF-8 ?span lang="EN-US"> Unicode 标准的编码字W集的字W编码方案,UTF-16 使用一个或两个未分配的 16 位代码单元的序列?span lang="EN-US"> Unicode 代码点进行编码;UTF-32 卛_每一?span lang="EN-US"> Unicode 代码点表CZؓ相同值的 32 位整数?span lang="EN-US">
========================================================
什么是unicode, GB2312, GBK, ANSI, UTF

发展q程 ASCII à GB2312(BIG5) à GBKàGB18030 

字符必须~码后才能被计算机处理。计机使用的缺省编码方式就是计机的内码。早期的计算Z?span lang="EN-US">7位的ASCII~码Qؓ了处理汉字,E序员设计了用于体中文的GB2312和用于繁体中文的big5?span lang="EN-US">

GB2312(1980q?span lang="EN-US">)一共收录了7445个字W,包括6763个汉字和682个其它符受汉字区的内码范围高字节?span lang="EN-US">B0-F7Q低字节?span lang="EN-US">A1-FEQ占用的码位?span lang="EN-US">72*94=6768。其中有5个空位是D7FA-D7FE?span lang="EN-US">

GB2312支持的汉字太?span lang="EN-US">1995q的汉字扩展规范GBK1.0收录?span lang="EN-US">21886个符P它分为汉字区和图形符号区。汉字区包括21003个字W?span lang="EN-US">

?span lang="EN-US">ASCII?span lang="EN-US">GB2312?span lang="EN-US">GBKQ这些编码方法是向下兼容的,卛_一个字W在q些Ҏ中L有相同的~码Q后面的标准支持更多的字W。在q些~码中,英文和中文可以统一地处理。区分中文编码的Ҏ是高字节的最高位不ؓ0。按照程序员的称|GB2312?span lang="EN-US">GBK都属于双字节字符?span lang="EN-US"> (DBCS)?span lang="EN-US">

2000q的GB18030是取?span lang="EN-US">GBK1.0的正式国家标准。该标准收录?span lang="EN-US">27484个汉字,同时q收录了藏文、蒙文、维向ְ文等主要的少数民族文字。从汉字字汇上说Q?span lang="EN-US">GB18030?span lang="EN-US">GB13000.1?span lang="EN-US">20902个汉字的基础上增加了CJK扩展A?span lang="EN-US">6582个汉字(Unicode?span lang="EN-US"> 0x3400-0x4db5Q,一共收录了27484个汉字?span lang="EN-US">

CJK是中日韩的意思?span lang="EN-US">UnicodeZ节省码位Q将中日韩三国语a中的文字l一~码?span lang="EN-US">GB13000.1是ISO/IEC 10646-1的中文版Q相当于Unicode 1.1?span lang="EN-US">

GB18030的编码采用单字节、双字节?span lang="EN-US">4字节Ҏ。其中单字节、双字节?span lang="EN-US">GBK是完全兼容的?span lang="EN-US">4字节~码的码位就是收录了CJK扩展A?span lang="EN-US">6582个汉字。例如:UCS?span lang="EN-US">0x3400?span lang="EN-US">GB18030中的~码应该?span lang="EN-US">8139EF30Q?span lang="EN-US">UCS?span lang="EN-US">0x3401?span lang="EN-US">GB18030中的~码应该?span lang="EN-US">8139EF31?span lang="EN-US">

微Y提供?span lang="EN-US">GB18030的升U包Q但q个升包只是提供了一套支?span lang="EN-US">CJK扩展A?span lang="EN-US">6582个汉字的新字体:新宋?span lang="EN-US">-18030Qƈ不改变内码?span lang="EN-US">Windows 的内码仍然是GBK?span lang="EN-US"> 

?span lang="EN-US">ASCII?span lang="EN-US">GB2312?span lang="EN-US">GBK?span lang="EN-US">GB18030的编码方法是向下兼容的。?span lang="EN-US">Unicode只与ASCII兼容

Unicode也是一U字W编码方法,不过它是由国际组l设计,可以容纳全世界所有语a文字的编码方案?span lang="EN-US">unicode ?span lang="EN-US">java 中的~码转换桥梁,使用了以l流qo器来桥接unicode~码文本和本地操作系l编码文本的隔阂(内码,?span lang="EN-US">windows?span lang="EN-US">GBK).所有的class z?span lang="EN-US">abstract class Reader and Writer .后面l箋研究

׃现有的大量程序和文档都采用了某种特定语言的编码,例如GBKQ?span lang="EN-US">Windows不可能不支持现有的编码,而全部改?span lang="EN-US">Unicode。我们称GBK?span lang="EN-US">windows的内?span lang="EN-US">.Windows使用代码?span lang="EN-US">(code page)来适应各个国家和地区?span lang="EN-US">code page可以被理解ؓ内码?span lang="EN-US">GBK对应?span lang="EN-US">code page?span lang="EN-US">CP936?

what is UCS?

Unicode的学名是"Universal Multiple-Octet Coded Character Set"Q简UCؓUCS?span lang="EN-US">UCS可以看作?span lang="EN-US">"Unicode Character Set"的羃写?span lang="EN-US">

UCS有两U格式:UCS-2?span lang="EN-US">UCS-4。顾名思义Q?span lang="EN-US">UCS-2是用两个字节编码,UCS-4是?span lang="EN-US">4个字节(实际上只用了31位,最高位必须?span lang="EN-US">0Q编码?span lang="EN-US"> 

什么是UTF

UTFQ是Unicode Text Format的羃写,意ؓUnicode文本格式。对?span lang="EN-US">UTFQ是q样定义?span lang="EN-US"> 

Q?span lang="EN-US">1Q如?span lang="EN-US">Unicode?span lang="EN-US">16位字W的?span lang="EN-US">9位是0Q则用一个字节表C,q个字节的首位是 ??/span>Q剩下的7位与原字W中的后7位相同,?span lang="EN-US">“\u0034?/span>Q?span lang="EN-US">0000 0000 0011 0100Q,?span lang="EN-US">?4?(0011 0100)表示Q(与源Unicode字符是相同的Q; 

Q?span lang="EN-US">2Q如?span lang="EN-US">Unicode?span lang="EN-US">16位字W的?span lang="EN-US">5位是0Q则?span lang="EN-US">2个字节表C,首字节是?10?/span>开_后面?span lang="EN-US">5位与源字W中除去?span lang="EN-US">5个零后的最?span lang="EN-US">5位相同;W二个字节以?0?/span>开_后面?span lang="EN-US">6位与源字W中的低6位相同。如“\ u025d?/span>Q?span lang="EN-US">0000 0010 0101 1101Q,转化后ؓ“c99d?/span>Q?span lang="EN-US">1100 1001 1001 1101Q;

Q?span lang="EN-US">3Q如果不W合上述两个规则Q则用三个字节表C。第一个字节以?110?/span>开_后四位ؓ源字W的高四位;W二个字节以?0?/span>开_后六位ؓ源字W中间的六位Q第三个字节?span lang="EN-US">?0?/span>开_后六位ؓ源字W的低六位;?span lang="EN-US">“\u9da7?/st1:chmetcnv>Q?span lang="EN-US">1001 1101 1010 0111Q,转化?span lang="EN-US">“e9b6a7?/st1:chmetcnv>Q?span lang="EN-US">1110 1001 1011 0110 1010 0111Q; 

UCS ?span lang="EN-US"> UTF 的联p?span lang="EN-US">

UTF-8是?span lang="EN-US">8位ؓ单元?span lang="EN-US">UCSq行~码

UTF-16?span lang="EN-US">16位ؓ单元?span lang="EN-US">UCSq行~码

big endian?span lang="EN-US">little endian

big endian?span lang="EN-US">little endian?span lang="EN-US">CPU处理多字节数的不同方式。例?span lang="EN-US">?/span>?span lang="EN-US">?/span>字的Unicode~码?st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="6" unitname="C">6C49。那么写到文仉ӞI竟是将6C写在前面Q还是将49写在前面Q如果将6C写在前面Q就?span lang="EN-US">big endian。如果将49写在前面Q就?span lang="EN-US">little endian?span lang="EN-US">

“endian?/span>q个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头(Big-Endian)敲开q是从小?span lang="EN-US">(Little-Endian)敲开Q由此曾发生q六ơ叛乱,一个皇帝送了命,另一个丢了王位?span lang="EN-US">

我们一般将endian译?span lang="EN-US">?/span>字节?span lang="EN-US">?/span>Q将big endian?span lang="EN-US">little endianUC?/span>大尾?/span>?span lang="EN-US">?/span>尾?/span>?span lang="EN-US">
=================================================
GB2312
?span lang="EN-US">GBK
的子集,GBK?span lang="EN-US">GB18030
的子?span lang="EN-US">
GBK是包括中日韩字符的大字符集合
如果是中文的|站 推荐GB2312 GBK有时q是有点问题
Z避免所有ؕ码问题,应该采用UTF-8Q将来要支持国际化也非常方便
UTF-8
可以看作是大字符集,它包含了大部分文字的~码?span lang="EN-US">
使用UTF-8的一个好处是其他地区的用P如香港台湾)无需安装体中文支持就能正常观看你的文字而不会出Cؕ码?span lang="EN-US">

词条Q?span lang="EN-US">UTF8
UTF8
q不是一U电脑编码,而是一U储存和传送的格式Q如前所qͼ每个Unicode/UCS字符都以 2?span lang="EN-US">4?span lang="EN-US">bytes来储存,看看以下的比较:

  ?span lang="EN-US">"I am Chinese"Z
   ?span lang="EN-US">ANSI储存Q?span lang="EN-US">12 Bytes
   ?span lang="EN-US">Unicode/UCS2储存Q?span lang="EN-US">24 Bytes + 2 Bytes(header)
   ?span lang="EN-US">UCS4储存Q?span lang="EN-US">48 Bytes + 4 Bytes(header)

  ?span lang="EN-US">"我是中国?span lang="EN-US">"Z
   ?span lang="EN-US">ANSI储存Q?span lang="EN-US">10 Bytes
   ?span lang="EN-US">Unicode/UCS2储存Q?span lang="EN-US">10 Bytes + 2 Bytes(header)
   ?span lang="EN-US">UCS4储存Q?span lang="EN-US">20 Bytes + 4 Bytes(header)

  由此可见直接?span lang="EN-US">Unicode/UCS的原始Ş式来储存是一U极大的费Q而且也不利于互联|的传输(中文Eؓ合算一?span lang="EN-US">^_^)?span lang="EN-US">

  有见及此Q?span lang="EN-US">Unicode/UCS的压~Ş式-Q?span lang="EN-US">UTF8出现了,套用官方|站的首句话?span lang="EN-US">UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) lossless encoding of Unicode characters.』,׃UTF也适用于编?span lang="EN-US">UCSQ故亦可UCؓ?span lang="EN-US">UCS transformation formats (UTF)?span lang="EN-US">

  UTF8是以8bits?span lang="EN-US">1Bytes为编码的最基本单位Q当然也可以有基?span lang="EN-US">16bits?span lang="EN-US">32bits的Ş式,分别UCؓUTF16?span lang="EN-US">UTF32Q但目前用得不多Q?span lang="EN-US">UTF8则被q泛应用在文件储存和|络传输中?span lang="EN-US">


~码原理

先看q个模板Q?span lang="EN-US">

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

~码步骤Q?span lang="EN-US">
1)
首先定需要多个8bits(octets)
2)
按照上述模板填充每个octets的高?span lang="EN-US">bits
3) 把字W的bits填充?span lang="EN-US">x中,字符序Q低?span lang="EN-US">?/span>高位Q?span lang="EN-US">UTF8序Q最后一?span lang="EN-US">octet的最末位x?/span>W一?span lang="EN-US">octet最高位x
4)
解码的原理一栗?span lang="EN-US">

实例Q?span lang="EN-US">(留意每个bit的颜Ԍ_体字ؓ模板内容)

UCS-4 UTF-8
HEX BIN Bytes BIN HEX Bytes
0000 000A 00001010 4 00001010 0A 1
0000 0099 10011001 4 11000010 10011001 C2 99 2
0000 8D99 10001101 10011001 4 11101000 10110110 10011001 E8 B6 99 3

  不知大家看懂了没有,其实不懂也无所谓,反正又不用自qQ程式可以完全代功?span lang="EN-US">

  ?span lang="EN-US">UTF8格式储存的文件档首标识ؓEF BB BF?span lang="EN-US">


效率

  从上q编码原理中得出的结论是Q?span lang="EN-US">
   1.每个英文字母、数字所占的I间?span lang="EN-US">1 ByteQ?span lang="EN-US">
   2.泛欧语系、斯拉夫语字母占2 BytesQ?span lang="EN-US">
   3.汉字?span lang="EN-US">3 Bytes?span lang="EN-US">

  由此可见UTF8对英文来说是个非常诱人的ҎQ但对中文来说则不太合算Q无论用ANSIq是 Unicode/UCS2来编码都只用2 BytesQ但?span lang="EN-US">UTF8则需?span lang="EN-US">3 Bytes?span lang="EN-US">

  以下是一些统计资料,昄?span lang="EN-US">UTF8来储存文件每个字W所需的^均字节:
   1.拉丁语系q_?span lang="EN-US">1.1 BytesQ?span lang="EN-US">
   2.希腊文、俄文、阿拉伯文和希伯莱文q_?span lang="EN-US">1.7 BytesQ?span lang="EN-US">
   3.其他大部份文字如中文、日文、韩文?span lang="EN-US">Hindi(北印度语)用约3 BytesQ?span lang="EN-US">
   4.用超q?span lang="EN-US">4 Bytes的都是些非常用的文字符受?span lang="EN-US">

词条Q?span lang="EN-US">GB2312
字符必须~码后才能被计算机处理。计机使用的缺省编码方式就是计机的内码。早期的计算Z?span lang="EN-US">7位的ASCII~码Qؓ了处理汉字,E序员设计了用于体中文的GB2312和用于繁体中文的big5?

GB2312(1980q?span lang="EN-US">)一共收录了7445个字W,包括6763个汉字和682个其它符受汉字区的内码范围高字节?span lang="EN-US">B0-F7Q低字节?span lang="EN-US">A1-FEQ占用的码位?span lang="EN-US">72*94=6768。其中有5个空位是D7FA-D7FE?

GB2312支持的汉字太?span lang="EN-US">1995q的汉字扩展规范GBK1.0收录?span lang="EN-US">21886个符P它分为汉字区和图形符号区。汉字区包括21003个字W?span lang="EN-US">2000q的GB18030是取?span lang="EN-US">GBK1.0的正式国家标准。该标准收录?span lang="EN-US">27484个汉字,同时q收录了藏文、蒙文、维向ְ文等主要的少数民族文字。现在的PCq_必须支持GB18030Q对嵌入式品暂不作要求。所以手机?span lang="EN-US">MP3一般只支持GB2312?

?span lang="EN-US">ASCII?span lang="EN-US">GB2312?span lang="EN-US">GBK?span lang="EN-US">GB18030Q这些编码方法是向下兼容的,卛_一个字W在q些Ҏ中L有相同的~码Q后面的标准支持更多的字W。在q些~码中,英文和中文可以统一地处理。区分中文编码的Ҏ是高字节的最高位不ؓ0。按照程序员的称|GB2312?span lang="EN-US">GBK?span lang="EN-US">GB18030都属于双字节字符?span lang="EN-US"> (DBCS)?

有的中文Windows的缺省内码还?span lang="EN-US">GBKQ可以通过GB18030升包升U到GB18030。不q?span lang="EN-US">GB18030相对GBK增加的字W,普通h是很隄到的Q通常我们q是?span lang="EN-US">GBK指代中文Windows内码?

q里q有一些细节:

GB2312的原文还是区位码Q从Z码到内码Q需要在高字节和低字节上分别加上A0?

?span lang="EN-US">DBCS中,GB内码的存储格式始l是big endianQ即高位在前?

GB2312的两个字节的最高位都是1。但W合q个条g的码位只?span lang="EN-US">128*128=16384个。所?span lang="EN-US">GBK?span lang="EN-US">GB18030的低字节最高位都可能不?span lang="EN-US">1。不q这不媄?span lang="EN-US">DBCS字符的解析Q在dDBCS字符时Q只要遇到高位ؓ1的字节,可以将下两个字节作Z个双字节~码Q而不用管低字节的高位是什么?

 



肥仔 2008-11-07 22:43 发表评论
]]>
VC/C++的中文字W处理方?/title><link>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66250.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:39:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66250.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/66250.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66250.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/66250.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/66250.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">怎样?a name="baidusnap2"></a><b>汉字</b>转换?a name="baidusnap0"></a><b>整数</b>Q又怎样把该<b>整数</b>q原?b>汉字</b><span lang="EN-US"><?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char * str="</span><b><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">汉字</span></b><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">";BYTE *pstr=(BYTE*)str;BYTE B=pstr[i];B </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">是<b>整数</b><span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">一 引入问题</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">代码<span lang="EN-US"> wchar_t a[3]=L?/span>中国<span lang="EN-US">?/span>Q编译时出错Q出错信息ؓQ数l越界。但<span lang="EN-US">wchar_t </span>是一个宽字节cdQ数l?span lang="EN-US">a</span>的大应?span lang="EN-US">6</span>个字节,而两个汉字的?span lang="EN-US">unicode</span>码占<span lang="EN-US">4</span>个字节,再加上一个结束符Q最?span lang="EN-US">6</span>个字节,所以应该不会越界?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">N是编译器出问题了Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">?解决引入问题所需的知?/span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">   </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">主要需两方面的知识Q第一个ؓ字符其是汉字的~码Q以及语a和工L支持情况Q第二个?span lang="EN-US">vc/c++</span>?span lang="EN-US">MutiByte Charater Set </span>?span lang="EN-US"> Wide Character Set</span>有关内存分配的情c?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">?汉字的编码方式及?span lang="EN-US">vc/c++</span>中的处理</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">1.</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">汉字~码方式的介l?/span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">对英文字W的处理Q?span lang="EN-US">7</span>?span lang="EN-US">ASCII</span>码字W集中的字符卛_满使用需求,且英文字W在计算Z的输入及输出也非常简单,因此Q英文字W的输入、存储、内部处理和输出都可以只用同一个编码(?span lang="EN-US">ASCII</span>码)?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">而汉字是一U象形文字,字数极多Q现代汉字中仅常用字有六、七千个Qd数高?span lang="EN-US">5</span>万个以上Q,且字形复杂,每一个汉字都?span lang="EN-US">"</span>韟뀁Ş、义<span lang="EN-US">"</span>三要素,同音字、异体字也很多,q些都给汉字的的计算机处理带来了很大的困难。要在计机中处理汉字,必须解决以下几个问题Q首先是汉字的输入,卛_何把l构复杂的方块汉字输入到计算Z去,q是汉字处理的关键;其次Q汉字在计算机内如何表示和存储?如何与西文兼容?最后,如何汉字的处理l果从计机内输出? <span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">为此Q必d汉字代码化,卛_汉字q行~码。对应于上述汉字处理q程中的输入、内部处理及输出q三个主要环节,每一个汉字的~码都包括输入码、交换码、内部码和字形码。在计算机的汉字信息处理pȝ中,处理汉字时要q行如下的代码{换:输入码→交换码→内部码→字Ş码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(1)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输入码: 作用是,利用它和现有的标准西文键盘结合来输入汉字。输入码也称为外码。主要归为四c:<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">a)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">      </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">数字~码Q数字编码是用等长的数字串ؓ汉字逐一~号Q以q个~号作ؓ汉字的输入码。例如,Z码、电报码{都属于数字~码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">b)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">      </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">拼音码:拼音码是以汉字的读音为基的输入办法?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">c)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">      </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">字Ş码:字Ş码是以汉字的字Şl构为基的输入编码。例如,五笔字型码(王码Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">d)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">      </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">韛_Ş码:韛_Ş码是兼顾汉字的读韛_字Ş的输入编码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(2)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">交换码:用于汉字外码和内部码的交换。交换码的国家标准代号ؓ<span lang="EN-US">GB2312-80</span>?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(3)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">内部码:内部码是汉字在计机内的基本表示形式Q是计算机对汉字q行识别、存储、处理和传输所用的~码。内部码也是双字节编码,国标码两个字节的最高位都置?span lang="EN-US">"1"</span>Q即转换成汉字的内部码?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(4)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">字Ş码:字Ş码是表示汉字字Ş信息Q汉字的l构、Ş状、笔划等Q的~码Q用来实现计机Ҏ字的输出Q显C、打华ͼ?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">2.VC</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">中汉字的~码方式</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">   vc/c++</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">正是采用?span lang="EN-US">GB2312</span>内部码作为汉字的~码方式<span lang="EN-US">,</span>因此<span lang="EN-US">vc/c++</span>中的各种输入输出ҎQ如<span lang="EN-US">cin/wcin,cout/wcout,scanf/wsanf,printf/wprintf...</span>都是Z<span lang="EN-US">GB2312</span>的,如果汉字的内码不是这U编码方式,那么利用上述各种Ҏ׃会正的解析汉字?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">仔细观察<span lang="EN-US">ASCII</span>字符表,从第<span lang="EN-US">161</span>个字W开始,后面的字Wƈ不经ؓ用户所使用Q负g未用?span lang="EN-US">GB2312</span>~码方式充分利用q一Ҏ,?span lang="EN-US">161-255</span>Q?span lang="EN-US">-95~-1</span>Q之间的数值空间作为汉字的标识码。既?span lang="EN-US">255-161 = 94</span>不能满汉字定w的要求,将每两个字Wƈ在一?span lang="EN-US">(</span>即一个汉字占两个字节<span lang="EN-US">)</span>Q显Ӟ<span lang="EN-US">94* 94 =8836</span>基本上已l满了常用汉字个数的要求。计机处理字符Ӟ当连l处理到两个大与<span lang="EN-US">160(</span>?span lang="EN-US">-95~-1)</span>的字节时Q就认ؓq两个字节存放了一个汉字字W。可以用下面?span lang="EN-US">Demo</span>E序来模?span lang="EN-US">vc/c++</span>中输出汉字字W的q程?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">    </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">unsigned</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> input[50];<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; text-indent: 24pt; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">cin>>input;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">    </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">int</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> flag=0;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">    </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">for</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(</span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">int</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> i =0 ;i < 50 ;i++)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">    {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(input[i] > 0xa0 && input[i] != 0)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(flag == 1)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">              cout<<"chinese character"<<endl;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">              flag = 0;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">              flag++;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> <o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(input[i] == 0)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">break</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> <o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">           cout<<"english character"<<endl;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">       }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">}<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输入Q?span lang="EN-US">Hello</span>中国 Q?span lang="EN-US">?/span>中国<span lang="EN-US">?/span>对应?span lang="EN-US">GB2312</span>内码为:<span lang="EN-US">214 208</span>Q?span lang="EN-US">185 250</span>Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输出Q?span lang="EN-US">english character<o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">chinese character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">chinese character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">vc/c++</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">中的英文字符仍然采用<span lang="EN-US">ASCII</span>~码方式。可以设惻I其他国家E序员利?span lang="EN-US">vc/c++</span>~写E序输入本国字符Ӟ<span lang="EN-US">vc/c++</span>则会采用该国的字W编码方式来处理q些字符?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">    </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">问题又生了Q韩国的<span lang="EN-US">vc/c++</span>E序在中国的<span lang="EN-US">vc/c++</span>上运行时Q如果没有相应的内码库,则对韩语字符的显C有可能出现q。我个h猜测Q?span lang="EN-US">vc</span>安装E序中应该带有不同国家的内码库,q样一来肯定会占用很大的空间。如果所有的国家使用l一的编码方式,且所有的E序设计语言和开发工具都支持q种~码方式该多好!而现实中Q确实已l有q种~码方式了,且许多新的语a也都支持q种~码方式Q如<span lang="EN-US">Java</span>?span lang="EN-US">C#</span>{,它就是下面的<span lang="EN-US">Unicode</span>~码<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">3.</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">新的内码标准<span lang="EN-US">---Unicode</span></span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Unicode</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Q统<?xml:namespace prefix = st1 /><st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="1" unitname="?>一?/st1:chmetcnv>、万国码、单<st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="1" unitname="?>一?/st1:chmetcnv>Q是一U在计算Z使用的字W编码。它为每U语a中的每个字符讑֮了统一q且唯一的二q制~码Q以满跨语a、跨q_q行文本转换、处理的要求?span lang="EN-US">1990</span>q开始研发,<span lang="EN-US">1994</span>q正式公布。随着计算机工作能力的增强Q?span lang="EN-US">Unicode</span>也在面世以来的十多年里得到普及。最新版本的<span lang="EN-US"> Unicode </span>?span lang="EN-US"> <st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="31" month="3" year="2005">2005<span lang="EN-US"><span lang="EN-US">q?</span></span><span lang="EN-US"><span lang="EN-US">?1</span></span><span lang="EN-US"><span lang="EN-US">?/span></span></st1:chsdate><span lang="EN-US">推出的Unicode <st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="30" month="12" year="1899">4.1.0</st1:chsdate> </span></span>。另外,<span lang="EN-US">5.0 Beta</span>已于<st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="12" month="12" year="2005"><span lang="EN-US">2005</span>q?span lang="EN-US">12</span>?span lang="EN-US">12</span>?/st1:chsdate>推出Q以供各会员评h?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Unicode </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">~码pȝ可分为编码方式和实现方式两个层次?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">~码方式Q?span lang="EN-US">Unicode </span>的编码方式与<span lang="EN-US"> ISO 10646 </span>的通用字符集(<span lang="EN-US">Universal Character Set</span>Q?span lang="EN-US">UCS</span>Q概늛对应Q目前的用于实用?span lang="EN-US"> Unicode </span>版本对应?span lang="EN-US"> UCS-2</span>Q?span lang="EN-US">16</span>位的~码I间。也是每个字符占用<span lang="EN-US">2</span>个字节。这LZ一共最多可以表C?span lang="EN-US"> 216 </span>个字W。基本满_U语a的用。实际上目前版本?span lang="EN-US"> Unicode </span>未填充满这<span lang="EN-US">16</span>位编码,保留了大量空间作为特D用或来扩展?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">实现方式Q?span lang="EN-US">Unicode </span>的实现方式不同于~码方式。一个字W的<span lang="EN-US"> Unicode </span>~码是确定的。但是在实际传输q程中,׃不同pȝq_的设计不一定一_以及Z节省I间的目的,?span lang="EN-US"> Unicode </span>~码的实现方式有所不同?span lang="EN-US">Unicode </span>的实现方式称?span lang="EN-US">Unicode</span>转换格式Q?span lang="EN-US">Unicode Translation Format</span>Q简UCؓ<span lang="EN-US"> UTF</span>Q。如Q?span lang="EN-US">UTF-8 </span>~码Q这是一U变长编码,它将基本<span lang="EN-US">7</span>?span lang="EN-US">ASCII</span>字符仍用<span lang="EN-US">7</span>位编码表C,占用一个字节(首位?span lang="EN-US">0</span>Q。而遇C其他<span lang="EN-US"> Unicode </span>字符混合的情况,按一定算法{换,每个字符使用<span lang="EN-US">1-3</span>个字节编码,q利用首位ؓ<span lang="EN-US">0</span>?span lang="EN-US">1</span>q行识别?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Java</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">?span lang="EN-US">C#</span>语言都是采用<span lang="EN-US">Unicode</span>~码方式Q在q两U语a中定义一个字W,在内存中存放的就是这个字W的两字?span lang="EN-US">Unicode</span>码。如下所C:<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> a='</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">?span lang="EN-US">';    => </span>内存中存攄<span lang="EN-US">Unicode</span>码ؓQ?span lang="EN-US">25105</span></span></p><img src ="http://www.shnenglu.com/woaidongmao/aggbug/66250.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:39 <a href="http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66250.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Win32 字符~码http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66246.html肥仔肥仔Fri, 07 Nov 2008 14:33:00 GMThttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66246.htmlhttp://www.shnenglu.com/woaidongmao/comments/66246.htmlhttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66246.html#Feedback0http://www.shnenglu.com/woaidongmao/comments/commentRss/66246.htmlhttp://www.shnenglu.com/woaidongmao/services/trackbacks/66246.html阅读全文

肥仔 2008-11-07 22:33 发表评论
]]>
C++的三U字W编码方?/title><link>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66247.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:33:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66247.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/66247.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66247.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/66247.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/66247.html</trackback:ping><description><![CDATA[<p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">c++</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">通常使用的是三种~码方式Q分别是<span lang="EN-US">SBCS(single byte character set),MBCS(multi-byte characterset)</span>?span lang="EN-US">Unicode</span>字符集?span lang="EN-US">SBCS</span>是一个字节一个字W,<span lang="EN-US">MBCS</span>是几个字节一个字W,可能是一个,两个Q三个不{,但是实际上,l大多数时候用两个字节的Q所以有时候看?span lang="EN-US">DBCS(double-byte character set)</span>代替<span lang="EN-US">MBCS</span>也不奇怪;<span lang="EN-US">Unicode</span>一律是两个字节~码。在<span lang="EN-US">windows nt</span>内核中,<span lang="EN-US">API</span>一律用的?span lang="EN-US">unicode</span>~码Q所以如果你在编写Y件过E中使用?span lang="EN-US">unicode</span>~码方式Q系l也会自动{换成<span lang="EN-US">unicode</span>执行Q然后返回的l构再{换ؓ你用的cd。单字节表示?span lang="EN-US">char</span>Q?span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t.</span>我们是在单字节的光芒下成长v来的Q一旉完全抛弃单字节未免难以接受,但是有些时候我们又不可避免的需要?span lang="EN-US">unicode</span>字符集合Q那?span lang="EN-US">ms</span>提供的解军_法是泛_Q?span lang="EN-US">TChar<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们看看他的定义Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">#ifdef UNICODE<br>typedef wchar_t TCHAR;<br>#else<br>typedef char TCHAR;<br>#endif<o:p></o:p></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">ok</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">Q一切问题都解决了,我们只需要定?span lang="EN-US">UNICODE</span>׃样?span lang="EN-US">wchar_t,</span>是很方便。另外,?span lang="EN-US">windows</span>?span lang="EN-US">COM</span>中用的一律是<span lang="EN-US">unicode</span>Q但?span lang="EN-US">MFC</span>默认的确?span lang="EN-US">MBCS</span>Q所以你?span lang="EN-US">MFC</span>写的cd如果攑ֈ?span lang="EN-US">COM</span>下,有些字符的格式化方式或者返回值错误的Q原因就?span lang="EN-US">com</span>一律?span lang="EN-US">unicode</span>Q?span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t('00')</span>l尾Q?span lang="EN-US">char</span>却是使用<span lang="EN-US">'0'</span>l尾的。一般情况下Q普通字W需要加?span lang="EN-US">_T</span>宏才能正常运行,比如<span lang="EN-US">MFC</span>中你写道<span lang="EN-US">S = "FSDFSDF",</span>那么该类转到<span lang="EN-US">COM</span>下,需要写<span lang="EN-US">S = _T("FSDFSDF")</span>Q才可以。我们可以想象宏<span lang="EN-US">_T</span>?span lang="EN-US">TCHAr</span>的功能一P如果使用<span lang="EN-US">UNICODE</span>p动在<span lang="EN-US">constant string</span>前面加上<span lang="EN-US">L</span>Q否则就直接使用?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们说一些小问题Q?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">VC6</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">生成?span lang="EN-US">console application</span>?span lang="EN-US"><br>int main(int argc, char* argv[])<o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">VS C++ 2005</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">生成的是<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">int _tmain(int argc, _TCHAR* argv[])<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">昄Q用<span lang="EN-US">_tmain</span>更好Q?span lang="EN-US">why?<o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">You can also use <b>_tmain</b>, which is defined in TCHAR.h. <b>_tmain</b> will resolve to <b>main</b> unless _UNICODE is defined, in which case <b>_tmain</b> will resolve to <b>wmain</b>.(<a >http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx</a>#).<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们也会常常看到如下一些字W类型,<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">WCHAR wchar_t wchar_t <br>LPSTR zero-terminated string of char (char*) zero-terminated string of char (char*) <br>LPCSTR constant zero-terminated string of char (const char*) constant zero-terminated string of char (const char*) <br>LPWSTR zero-terminated Unicode string (wchar_t*) zero-terminated Unicode string (wchar_t*) <br>LPCWSTR constant zero-terminated Unicode string (const wchar_t*) constant zero-terminated Unicode string (const wchar_t*) <br>TCHAR char wchar_t <br>LPTSTR zero-terminated string of TCHAR (TCHAR*) zero-terminated string of TCHAR (TCHAR*) <br>LPCTSTR constant zero-terminated string of TCHAR (const TCHAR*) constant zero-terminated string of TCHAR (const TCHAR*) <br>C </span><span style="font-size: 12pt; color: #333333; font-family: 宋体">一般代?span lang="EN-US">constant</span>Q?span lang="EN-US">P</span>代表指针Q?span lang="EN-US">LP</span>代表长指?span lang="EN-US">,W</span>代表宽字W,也就?span lang="EN-US">UNICODE</span>Q这下是不是都能明白q些是干什么的了?<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们也会常常看到<span lang="EN-US">_mbsstr()</span>q样的函敎ͼq就?span lang="EN-US">MBCS</span>字符~码的函敎ͼ当然可以处理<span lang="EN-US">SBCS</span>~码Q但是反之却不行。所以ؓ了保险v见,我们可以使用<span lang="EN-US">_mbsstr</span>代替<span lang="EN-US">strstr,</span>但是如果E序只是处理<span lang="EN-US">SBCS</span>Q那么显然又影响效率Q所以到底用什么方式同时满x率和可移植性,自己掂量着办吧?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">以后使用<span lang="EN-US">C++</span>~写E序Q如果出Cؕ码,首先?span lang="EN-US">C++</span>的编码类型,而且一般情况下都是l束W号没有弄对Q?span lang="EN-US">SBCS</span>?span lang="EN-US">MBCS</span>都是以单字节<span lang="EN-US">0</span>l尾Q?span lang="EN-US">UNICODE</span>是以双字?span lang="EN-US">00</span>l尾的?span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: arial"><o:p> </o:p></span></p></span><img src ="http://www.shnenglu.com/woaidongmao/aggbug/66247.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:33 <a href="http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66247.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>l基癄----UTF-16http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66245.html肥仔肥仔Fri, 07 Nov 2008 14:31:00 GMThttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66245.htmlhttp://www.shnenglu.com/woaidongmao/comments/66245.htmlhttp://www.shnenglu.com/woaidongmao/archive/2008/11/07/66245.html#Feedback0http://www.shnenglu.com/woaidongmao/comments/commentRss/66245.htmlhttp://www.shnenglu.com/woaidongmao/services/trackbacks/66245.htmll基癄Q自q癄全书

跌{?/span>: D, 搜寻

UTF-16?/span>Unicode的其中一个用方式?/span> UTF?/span> Unicode/UCS Transformation FormatQ即?/span>Unicode转做某种格式的意思?/span>

它定义于ISO/IEC 10646-1的附?/span>QQ?/span>RFC2781也定义了怼的做法?/span>

?/span>Unicode基本多文U^?/span>定义的字W(无论是拉丁字母、汉字或其它文字或符PQ一律?/span>2字节储存。而在辅助q面定义的字W,会以代理?/i>Q?/span>surrogate pairQ的形式Q以两个2字节的值来储存?/span>

UTF-16比vUTF-8Q好处在于大部分字符都以固定长度的字?/span> (2字节) 储存Q但UTF-16却无法兼容于ASCII~码?/span>

[~辑] UTF-16的编码模?/span>

UTF-16的大ֺ和小ֺ储存形式都在用。一般来_?/span>Macintosh制作或储存的文字使用大尾序格式,?/span>Microsoft?/span>Linux制作或储存的文字使用尾序格式?/span>

Z弄清?/span>UTF-16文g的大尾序,?/span>UTF-16文g的开首,都会攄一?/span>U+FEFF字符作ؓByte Order Mark (UTF-16LE ?/span> FF FE 代表Q?/span>UTF-16BE ?/span> FE FF 代表)Q以昄q个文本文g是以UTF-16~码Q其?/span>U+FEFF字符?/span>UNICODE中代表的意义?/span>ZERO WIDTH NO-BREAK SPACEQ顾名思义Q它是个没有宽度也没有断字的I白?/span>

以下的例子有三个字符Q「朱?/span>(U+6731)、半角逗号 (U+002C)、「聿?/span>(U+807F)?/span>

使用 UTF-16 ~码的例?/span>

~码名称

~码ơ序

~码

BOM

"?/span>"

","

"?/span>"

 

UTF-16LE

尾?/span>

 

31 67

2C 00

7F 80

 

UTF-16BE

大尾?/span>

 

67 31

00 2C

80 7F

 

UTF-16

尾序,包含BOM

FF FE

31 67

2C 00

7F 80

 

UTF-16

大尾序,包含BOM

FE FF

67 31

00 2C

80 7F

 

[~辑] UTF-16 ?span lang="EN-US"> UCS-2 的关p?/span>

UTF-16可看成是UCS-2?/span>。在没有辅助q面字符前,UTF-16?/span>UCS-2所指的是同一的意思。但当引入辅助^面字W后Q就只称?/span>UTF-16了。现在若有Y件声U自己支?/span>UCS-2~码Q那其实是暗指它不能支持辅助q面字符的委婉语?/span>

 

 



肥仔 2008-11-07 22:31 发表评论
]]>
谈谈Unicode~码Q简要解释UCS、UTF、BMP、BOM{名?/title><link>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66242.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:14:00 GMT</pubDate><guid>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66242.html</guid><wfw:comment>http://www.shnenglu.com/woaidongmao/comments/66242.html</wfw:comment><comments>http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66242.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/woaidongmao/comments/commentRss/66242.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/woaidongmao/services/trackbacks/66242.html</trackback:ping><description><![CDATA[<p> </p> <p class=MsoNormal>q是一程序员写给E序员的味ȝ。所谓趣x指可以比较轻村֜了解一些原来不清楚的概念,增进知识Q类g?span lang=EN-US>RPG</span>游戏的升U。整理这文章的动机是两个问题:<span lang=EN-US><o:p></o:p></span></p> <p class=MsoNormal><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">问题一Q?<span lang=EN-US><o:p></o:p></span></span></p> <p style="MARGIN-LEFT: 36pt">使用<span lang=EN-US>Windows</span>C本的<span lang=EN-US>“</span>另存?span lang=EN-US>”</span>Q可以在<span lang=EN-US>GBK</span>?span lang=EN-US>Unicode</span>?span lang=EN-US>Unicode big endian</span>?span lang=EN-US>UTF-8</span>q几U编码方式间怺转换。同h<span lang=EN-US>txt</span>文gQ?span lang=EN-US>Windows</span>是怎样识别~码方式的呢Q?span lang=EN-US><o:p></o:p></span></p> <p style="MARGIN-LEFT: 36pt">我很早前发?span lang=EN-US>Unicode</span>?span lang=EN-US>Unicode big endian</span>?span lang=EN-US>UTF-8</span>~码?span lang=EN-US>txt</span>文g的开头会多出几个字节Q分别是<span lang=EN-US>FF</span>?span lang=EN-US>FE</span>Q?span lang=EN-US>Unicode</span>Q?span lang=EN-US>,FE</span>?span lang=EN-US>FF</span>Q?span lang=EN-US>Unicode big endian</span>Q?span lang=EN-US>,EF</span>?span lang=EN-US>BB</span>?span lang=EN-US>BF</span>Q?span lang=EN-US>UTF-8</span>Q。但q些标记是基于什么标准呢Q?span lang=EN-US><o:p></o:p></span></p> <p class=MsoNormal><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">问题二: <span lang=EN-US><o:p></o:p></span></span></p> <p class=MsoNormal style="MARGIN-LEFT: 36pt"><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">最q在|上看到一?span lang=EN-US>ConvertUTF.c</span>Q实C<span lang=EN-US>UTF-32</span>?span lang=EN-US>UTF-16</span>?span lang=EN-US>UTF-8</span>q三U编码方式的怺转换。对?span lang=EN-US>Unicode(UCS2)</span>?span lang=EN-US>GBK</span>?span lang=EN-US>UTF-8</span>q些~码方式Q我原来׃解。但q个E序让我有些p涂Q想不v?span lang=EN-US>UTF-16</span>?span lang=EN-US>UCS2</span>有什么关pR?<span lang=EN-US><o:p></o:p></span></span></p> <p>查了查相兌料,ȝ这些问题弄清楚了,带也了解了一?span lang=EN-US>Unicode</span>的细节。写成一文章,送给有过cM疑问的朋友。本文在写作时尽量做到通俗易懂Q但要求读者知道什么是字节Q什么是十六q制?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">0</span><span style="FONT-SIZE: 12pt">?span lang=EN-US>big endian</span>?span lang=EN-US>little endian<o:p></o:p></span></span></h3> </div> <p><span lang=EN-US>big endian</span>?span lang=EN-US>little endian</span>?span lang=EN-US>CPU</span>处理多字节数的不同方式。例?span lang=EN-US>“</span>?span lang=EN-US>”</span>字的<span lang=EN-US>Unicode</span>~码?st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>。那么写到文仉ӞI竟是将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv>写在前面Q还是将<span lang=EN-US>49</span>写在前面Q如果将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv>写在前面Q就?span lang=EN-US>big endian</span>。还是将<span lang=EN-US>49</span>写在前面Q就?span lang=EN-US>little endian</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>“endian”</span>q个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头<span lang=EN-US>(Big-Endian)</span>敲开q是从小?span lang=EN-US>(Little-Endian)</span>敲开Q由此曾发生q六ơ叛乱,其中一个皇帝送了命,另一个丢了王位?span lang=EN-US><o:p></o:p></span></p> <p>我们一般将<span lang=EN-US>endian</span>译?span lang=EN-US>“</span>字节?span lang=EN-US>”</span>Q将<span lang=EN-US>big endian</span>?span lang=EN-US>little endian</span>UC<span lang=EN-US>“</span>大尾<span lang=EN-US>”</span>?span lang=EN-US>“</span>尾<span lang=EN-US>”</span>?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">1</span><span style="FONT-SIZE: 12pt">、字W编码、内码,带介绍汉字~码<span lang=EN-US><o:p></o:p></span></span></h3> </div> <p>字符必须~码后才能被计算机处理。计机使用的缺省编码方式就是计机的内码。早期的计算Z?span lang=EN-US>7</span>位的<span lang=EN-US>ASCII</span>~码Qؓ了处理汉字,E序员设计了用于体中文的<span lang=EN-US>GB2312</span>和用于繁体中文的<span lang=EN-US>big5</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>GB2312(1980</span>q?span lang=EN-US>)</span>一共收录了<span lang=EN-US>7445</span>个字W,包括<span lang=EN-US>6763</span>个汉字和<span lang=EN-US>682</span>个其它符受汉字区的内码范围高字节?span lang=EN-US>B0-F7</span>Q低字节?span lang=EN-US>A1-FE</span>Q占用的码位?span lang=EN-US>72*94=6768</span>。其中有<span lang=EN-US>5</span>个空位是<span lang=EN-US>D7FA-D7FE</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>GB2312</span>支持的汉字太?span lang=EN-US>1995</span>q的汉字扩展规范<span lang=EN-US>GBK1.0</span>收录?span lang=EN-US>21886</span>个符P它分为汉字区和图形符号区。汉字区包括<span lang=EN-US>21003</span>个字W?span lang=EN-US>2000</span>q的<span lang=EN-US>GB18030</span>是取?span lang=EN-US>GBK1.0</span>的正式国家标准。该标准收录?span lang=EN-US>27484</span>个汉字,同时q收录了藏文、蒙文、维向ְ文等主要的少数民族文字。现在的<span lang=EN-US>PC</span>q_必须支持<span lang=EN-US>GB18030</span>Q对嵌入式品暂不作要求。所以手机?span lang=EN-US>MP3</span>一般只支持<span lang=EN-US>GB2312</span>?span lang=EN-US><o:p></o:p></span></p> <p>?span lang=EN-US>ASCII</span>?span lang=EN-US>GB2312</span>?span lang=EN-US>GBK</span>?span lang=EN-US>GB18030</span>Q这些编码方法是向下兼容的,卛_一个字W在q些Ҏ中L有相同的~码Q后面的标准支持更多的字W。在q些~码中,英文和中文可以统一地处理。区分中文编码的Ҏ是高字节的最高位不ؓ<span lang=EN-US>0</span>。按照程序员的称|<span lang=EN-US>GB2312</span>?span lang=EN-US>GBK</span>?span lang=EN-US>GB18030</span>都属于双字节字符?span lang=EN-US> (DBCS)</span>?span lang=EN-US><o:p></o:p></span></p> <p>有的中文<span lang=EN-US>Windows</span>的缺省内码还?span lang=EN-US>GBK</span>Q可以通过<span lang=EN-US>GB18030</span>升包升U到<span lang=EN-US>GB18030</span>。不q?span lang=EN-US>GB18030</span>相对<span lang=EN-US>GBK</span>增加的字W,普通h是很隄到的Q通常我们q是?span lang=EN-US>GBK</span>指代中文<span lang=EN-US>Windows</span>内码?span lang=EN-US><o:p></o:p></span></p> <p>q里q有一些细节:<span lang=EN-US><o:p></o:p></span></p> <p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">·<span style="FONT: 7pt 'Times New Roman'">         </span></span></span><span lang=EN-US>GB2312</span>的原文还是区位码Q从Z码到内码Q需要在高字节和低字节上分别加上<span lang=EN-US>A0</span>?span lang=EN-US><o:p></o:p></span></p> <p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">·<span style="FONT: 7pt 'Times New Roman'">         </span></span></span>?span lang=EN-US>DBCS</span>中,<span lang=EN-US>GB</span>内码的存储格式始l是<span lang=EN-US>big endian</span>Q即高位在前?span lang=EN-US><o:p></o:p></span></p> <p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">·<span style="FONT: 7pt 'Times New Roman'">         </span></span></span><span lang=EN-US>GB2312</span>的两个字节的最高位都是<span lang=EN-US>1</span>。但W合q个条g的码位只?span lang=EN-US>128*128=16384</span>个。所?span lang=EN-US>GBK</span>?span lang=EN-US>GB18030</span>的低字节最高位都可能不?span lang=EN-US>1</span>。不q这不媄?span lang=EN-US>DBCS</span>字符的解析Q在d<span lang=EN-US>DBCS</span>字符时Q只要遇到高位ؓ<span lang=EN-US>1</span>的字节,可以将下两个字节作Z个双字节~码Q而不用管低字节的高位是什么?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">2</span><span style="FONT-SIZE: 12pt">?span lang=EN-US>Unicode</span>?span lang=EN-US>UCS</span>?span lang=EN-US>UTF<o:p></o:p></span></span></h3> </div> <p>前面提到?span lang=EN-US>ASCII</span>?span lang=EN-US>GB2312</span>?span lang=EN-US>GBK</span>?span lang=EN-US>GB18030</span>的编码方法是向下兼容的。?span lang=EN-US>Unicode</span>只与<span lang=EN-US>ASCII</span>兼容Q更准确地说Q是?span lang=EN-US>ISO-8859-1</span>兼容Q,?span lang=EN-US>GB</span>码不兼容。例?span lang=EN-US>“</span>?span lang=EN-US>”</span>字的<span lang=EN-US>Unicode</span>~码?st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>Q?span lang=EN-US>GB</span>码是<span lang=EN-US>BABA</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>Unicode</span>也是一U字W编码方法,不过它是由国际组l设计,可以容纳全世界所有语a文字的编码方案?span lang=EN-US>Unicode</span>的学名是<span lang=EN-US>"Universal Multiple-Octet Coded Character Set"</span>Q简UCؓ<span lang=EN-US>UCS</span>?span lang=EN-US>UCS</span>可以看作?span lang=EN-US>"Unicode Character Set"</span>的羃写?span lang=EN-US><o:p></o:p></span></p> <p>Ҏl基癄全书<span lang=EN-US>(http://zh.wikipedia.org/wiki/)</span>的记载:历史上存在两个试囄立设?span lang=EN-US>Unicode</span>的组l,卛_际标准化l织Q?span lang=EN-US>ISO</span>Q和一个Y件制造商的协会(<span lang=EN-US>unicode.org</span>Q?span lang=EN-US>ISO</span>开发了<span lang=EN-US>ISO 10646</span>目Q?span lang=EN-US>Unicode</span>协会开发了<span lang=EN-US>Unicode</span>目?span lang=EN-US><o:p></o:p></span></p> <p>?span lang=EN-US>1991</span>q前后,双方都认识到世界不需要两个不兼容的字W集。于是它们开始合q双方的工作成果Qƈ为创立一个单一~码表而协同工作。从<span lang=EN-US>Unicode2.0</span>开始,<span lang=EN-US>Unicode</span>目采用了与<span lang=EN-US>ISO 10646-1</span>相同的字库和字码?span lang=EN-US><o:p></o:p></span></p> <p>目前两个目仍都存在Qƈ独立地公布各自的标准?span lang=EN-US>Unicode</span>协会现在的最新版本是<span lang=EN-US>2005</span>q的<span lang=EN-US>Unicode <st1:chsdate w:st="on" year="1899" month="12" day="30" islunardate="False" isrocdate="False">4.1.0</st1:chsdate></span>?span lang=EN-US>ISO</span>的最新标准是<span lang=EN-US>10646-3:2003</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>UCS</span>规定了怎么用多个字节表C各U文字。怎样传输q些~码Q是?span lang=EN-US>UTF(UCS Transformation Format)</span>规范规定的,常见?span lang=EN-US>UTF</span>规范包括<span lang=EN-US>UTF-8</span>?span lang=EN-US>UTF-7</span>?span lang=EN-US>UTF-16</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>IETF</span>?span lang=EN-US>RFC2781</span>?span lang=EN-US>RFC3629</span>?span lang=EN-US>RFC</span>的一贯风|清晰、明快又不失严}地描qC<span lang=EN-US>UTF-16</span>?span lang=EN-US>UTF-8</span>的编码方法。我LC?span lang=EN-US>IETF</span>?span lang=EN-US>Internet Engineering Task Force</span>的羃写。但<span lang=EN-US>IETF</span>负责l护?span lang=EN-US>RFC</span>?span lang=EN-US>Internet</span>上一切规范的基础?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">3</span><span style="FONT-SIZE: 12pt">?span lang=EN-US>UCS-2</span>?span lang=EN-US>UCS-4</span>?span lang=EN-US>BMP<o:p></o:p></span></span></h3> </div> <p><span lang=EN-US>UCS</span>有两U格式:<span lang=EN-US>UCS-2</span>?span lang=EN-US>UCS-4</span>。顾名思义Q?span lang=EN-US>UCS-2</span>是用两个字节编码,<span lang=EN-US>UCS-4</span>是?span lang=EN-US>4</span>个字节(实际上只用了<span lang=EN-US>31</span>位,最高位必须?span lang=EN-US>0</span>Q编码。下面让我们做一些简单的数学游戏Q?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>UCS-2</span>?span lang=EN-US>2^16=65536</span>个码位,<span lang=EN-US>UCS-4</span>?span lang=EN-US>2^31=2147483648</span>个码位?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>UCS-4</span>Ҏ最高位?span lang=EN-US>0</span>的最高字节分?span lang=EN-US>2^7=128</span>?span lang=EN-US>group</span>。每?span lang=EN-US>group</span>再根据次高字节分?span lang=EN-US>256</span>?span lang=EN-US>plane</span>。每?span lang=EN-US>plane</span>ҎW?span lang=EN-US>3</span>个字节分?span lang=EN-US>256</span>?span lang=EN-US> (rows)</span>Q每行包?span lang=EN-US>256</span>?span lang=EN-US>cells</span>。当然同一行的<span lang=EN-US>cells</span>只是最后一个字节不同,其余都相同?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>group 0</span>?span lang=EN-US>plane 0</span>被称?span lang=EN-US>Basic Multilingual Plane, </span>?span lang=EN-US>BMP</span>。或者说<span lang=EN-US>UCS-4</span>中,高两个字节ؓ<span lang=EN-US>0</span>的码位被UC<span lang=EN-US>BMP</span>?span lang=EN-US><o:p></o:p></span></p> <p>?span lang=EN-US>UCS-4</span>?span lang=EN-US>BMP</span>L前面的两个零字节得C<span lang=EN-US>UCS-2</span>。在<span lang=EN-US>UCS-2</span>的两个字节前加上两个零字节,得C<span lang=EN-US>UCS-4</span>?span lang=EN-US>BMP</span>。而目前的<span lang=EN-US>UCS-4</span>规范中还没有M字符被分配在<span lang=EN-US>BMP</span>之外?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">4</span><span style="FONT-SIZE: 12pt">?span lang=EN-US>UTF</span>~码<span lang=EN-US><o:p></o:p></span></span></h3> </div> <p><span lang=EN-US>UTF-8</span>是?span lang=EN-US>8</span>位ؓ单元?span lang=EN-US>UCS</span>q行~码。从<span lang=EN-US>UCS-2</span>?span lang=EN-US>UTF-8</span>的编码方式如下:<span lang=EN-US><o:p></o:p></span></p> <table class=MsoNormalTable style="WIDTH: 75%; mso-cellspacing: 1.5pt" cellPadding=0 width="75%" border=1> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes"> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">UCS-2</span><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">~码<span lang=EN-US>(16</span>q制<span lang=EN-US>)</span></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">UTF-8 </span><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">字节?span lang=EN-US>(</span>二进?span lang=EN-US>)</span></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> </tr> <tr style="mso-yfti-irow: 1"> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0000 - <st1:chmetcnv unitname="F" sourcevalue="7" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on">007F</st1:chmetcnv></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0xxxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> </tr> <tr style="mso-yfti-irow: 2"> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0080 - 07FF</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">110xxxxx 10xxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> </tr> <tr style="mso-yfti-irow: 3; mso-yfti-lastrow: yes"> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0800 - FFFF</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt"> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">1110xxxx 10xxxxxx 10xxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p> </td> </tr> </tbody> </table> <p>例如<span lang=EN-US>“</span>?span lang=EN-US>”</span>字的<span lang=EN-US>Unicode</span>~码?st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>?st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>?span lang=EN-US>0800-FFFF</span>之间Q所以肯定要?span lang=EN-US>3</span>字节模板了:<span lang=EN-US style="COLOR: blue">1110</span><span lang=EN-US>xxxx <span style="COLOR: blue">10</span>xxxxxx <span style="COLOR: blue">10</span>xxxxxx</span>。将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>写成二进制是Q?span lang=EN-US>0110 110001 001001</span>Q?用这个比Ҏ依次代替模板中的<span lang=EN-US>x</span>Q得刎ͼ<span lang=EN-US style="COLOR: blue">1110</span><span lang=EN-US>0110 <span style="COLOR: blue">10</span>110001 <span style="COLOR: blue">10</span>001001</span>Q即<span lang=EN-US>E6 B1 89</span>?span lang=EN-US><o:p></o:p></span></p> <p>读者可以用C本测试一下我们的~码是否正确?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>UTF-16</span>?span lang=EN-US>16</span>位ؓ单元?span lang=EN-US>UCS</span>q行~码。对于小?span lang=EN-US>0x10000</span>?span lang=EN-US>UCS</span>码,<span lang=EN-US>UTF-16</span>~码q?span lang=EN-US>UCS</span>码对应的<span lang=EN-US>16</span>位无W号整数。对于不于<span lang=EN-US>0x10000</span>?span lang=EN-US>UCS</span>码,定义了一个算法。不q由于实际用的<span lang=EN-US>UCS2</span>Q或?span lang=EN-US>UCS4</span>?span lang=EN-US>BMP</span>必然于<span lang=EN-US>0x10000</span>Q所以就目前而言Q可以认?span lang=EN-US>UTF-16</span>?span lang=EN-US>UCS-2</span>基本相同。但<span lang=EN-US>UCS-2</span>只是一个编码方案,<span lang=EN-US>UTF-16</span>却要用于实际的传输,所以就不得不考虑字节序的问题?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">5</span><span style="FONT-SIZE: 12pt">?span lang=EN-US>UTF</span>的字节序?span lang=EN-US>BOM<o:p></o:p></span></span></h3> </div> <p><span lang=EN-US>UTF-8</span>以字节ؓ~码单元Q没有字节序的问题?span lang=EN-US>UTF-16</span>以两个字节ؓ~码单元Q在解释一?span lang=EN-US>UTF-16</span>文本前,首先要弄清楚每个~码单元的字节序。例如收C?span lang=EN-US>“</span>?span lang=EN-US>”</span>?span lang=EN-US>Unicode</span>~码?span lang=EN-US>594E</span>Q?span lang=EN-US>“</span>?span lang=EN-US>”</span>?span lang=EN-US>Unicode</span>~码?span lang=EN-US>4E59</span>。如果我们收?span lang=EN-US>UTF-16</span>字节?span lang=EN-US>“594E”</span>Q那么这?span lang=EN-US>“</span>?span lang=EN-US>”</span>q是<span lang=EN-US>“</span>?span lang=EN-US>”</span>Q?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>Unicode</span>规范中推荐的标记字节序的方法是<span lang=EN-US>BOM</span>?span lang=EN-US>BOM</span>不是<span lang=EN-US>“Bill Of Material”</span>?span lang=EN-US>BOM</span>表,而是<span lang=EN-US>Byte Order Mark</span>?span lang=EN-US>BOM</span>是一个有点小聪明的想法:<span lang=EN-US><o:p></o:p></span></p> <p>?span lang=EN-US>UCS</span>~码中有一个叫?span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>的字W,它的~码?span lang=EN-US>FEFF</span>。?span lang=EN-US>FFFE</span>?span lang=EN-US>UCS</span>中是不存在的字符Q所以不应该出现在实际传输中?span lang=EN-US>UCS</span>规范我们在传输字节流前,先传输字W?span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>?span lang=EN-US><o:p></o:p></span></p> <p>q样如果接收者收?span lang=EN-US>FEFF</span>Q就表明q个字节是<span lang=EN-US>Big-Endian</span>的;如果收到<span lang=EN-US>FFFE</span>Q就表明q个字节是<span lang=EN-US>Little-Endian</span>的。因此字W?span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>又被UC<span lang=EN-US>BOM</span>?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>UTF-8</span>不需?span lang=EN-US>BOM</span>来表明字节顺序,但可以用<span lang=EN-US>BOM</span>来表明编码方式。字W?span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>?span lang=EN-US>UTF-8</span>~码?span lang=EN-US>EF BB BF</span>Q读者可以用我们前面介绍的编码方法验证一下)。所以如果接收者收C<span lang=EN-US>EF BB BF</span>开头的字节,q道这?span lang=EN-US>UTF-8</span>~码了?span lang=EN-US><o:p></o:p></span></p> <p><span lang=EN-US>Windows</span>是使用<span lang=EN-US>BOM</span>来标记文本文件的~码方式的?span lang=EN-US><o:p></o:p></span></p> <div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt"> <h3><span lang=EN-US style="FONT-SIZE: 12pt">6</span><span style="FONT-SIZE: 12pt">、进一步的参考资?span lang=EN-US><o:p></o:p></span></span></h3> </div> <p>本文主要参考的资料?span lang=EN-US> "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html)</span>?span lang=EN-US><o:p></o:p></span></p> <p>我还找了两篇看上M错的资料Q不q因为我开始的疑问都找C{案Q所以就没有看:<span lang=EN-US><o:p></o:p></span></p> <ol type=1> <li id="e8uuwii" class=MsoNormal style="TEXT-ALIGN: left; mso-list: l1 level1 lfo2; tab-stops: list 36.0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-pagination: widow-orphan"><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter<st1:chmetcnv unitname="a" sourcevalue="4" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on">04a</st1:chmetcnv>) <o:p></o:p></span> <li id="2smqi0g" class=MsoNormal style="TEXT-ALIGN: left; mso-list: l1 level1 lfo2; tab-stops: list 36.0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-pagination: widow-orphan"><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03) <o:p></o:p></span></li> </ol> <p>我写q?span lang=EN-US>UTF-8</span>?span lang=EN-US>UCS-2</span>?span lang=EN-US>GBK</span>怺转换的Y件包Q包括?span lang=EN-US>Windows API</span>和不使用<span lang=EN-US>Windows API</span>的版本。以后有旉的话Q我会整理一下放到我的个Z上<span lang=EN-US>(http://fmddlmyy.home4u.china.com)</span>?span lang=EN-US><o:p></o:p></span></p> <p>我是x楚所有问题后才开始写q篇文章的,原以Z会儿p写好。没惛_考虑措辞和查证细节花费了很长旉Q竟然从下午<span lang=EN-US>1:30</span>写到<span lang=EN-US>9:00</span>。希望有读者能从中受益?span lang=EN-US><o:p></o:p></span></p> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p> </o:p></span></p> <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p></o:p></span> </p> <p class=MsoNormal>附录1 再说说区位码、GB2312、内码和代码?br>有的朋友Ҏ章中q句话还有疑问:<br>“GB2312的原文还是区位码Q从Z码到内码Q需要在高字节和低字节上分别加上A0?#8221;<br><br><br><br>我再详细解释一下:<br><br>“GB2312的原?#8221;是指国家1980q的一个标准《中华h民共和国国家标准 信息交换用汉字编码字W集 基本?nbsp;GB 2312-80》。这个标准用两个数来~码汉字和中文符受第一个数UCؓ“?#8221;Q第二个数称?#8220;?#8221;。所以也UCؓZ码?-9区是中文W号Q?6-55区是一U汉字,56-87区是二汉字。现在Windows也还有区位输入法Q例如输?601得到“?#8221;。(q个Z输入法可以自动识?6q制的GB2312?0q制的区位码Q也是说输入B0A1同样会得?#8220;?#8221;。)<br><br>内码是指操作pȝ内部的字W编码。早期操作系l的内码是与语言相关的。现在的Windows在系l内部支持UnicodeQ然后用代码适应各种语言Q?#8220;内码”的概念就比较模糊了。微软一般将~省代码|定的~码说成是内码?br><br>内码q个词汇Qƈ没有什么官方的定义Q代码页也只是微软这个公司的叫法。作为程序员Q我们只要知道它们是什么东西,没有必要q多地考证q些名词?br><br>所谓代码页(code page)是针对一U语a文字的字W编码。例如GBK的code page是CP936QBIG5的code page是CP950QGB2312的code page是CP20936?br><br>Windows中有~省代码늚概念Q即~省用什么编码来解释字符。例如Windows的记事本打开了一个文本文Ӟ里面的内Ҏ字节:BA、BA、D7、D6。Windows应该L么解释它呢Q?br><br>是按照Unicode~码解释、还是按照GBK解释、还是按照BIG5解释Q还是按照ISO8859-1去解释?如果按GBK去解释,׃得到“汉字”两个字。按照其它编码解释,可能找不到对应的字符Q也可能扑ֈ错误的字W。所?#8220;错误”是指与文本作者的本意不符Q这时就产生了ؕ码?br><br>{案是Windows按照当前的缺省代码页去解释文本文仉的字节流。缺省代码页可以通过控制面板的区域选项讄。记事本的另存ؓ中有一ANSIQ其实就是按照缺省代码页的编码方法保存?br><br>Windows的内码是UnicodeQ它在技术上可以同时支持多个代码c只要文件能说明自己使用什么编码,用户又安装了对应的代码页QWindowsp正确昄Q例如在HTML文g中就可以指定charset?br><br>有的HTML文g作者,特别是英文作者,认ؓ世界上所有h都用英文,在文件中不指定charset。如果他使用?x80-0xff之间的字W,中文Windows又按照缺省的GBK去解释,׃出现q。这时只要在q个html文g中加上指定charset的语句,例如Q?br><meta http-equiv="Content-Type" content="text/html; charset=ISO8859-1"><br>如果原作者用的代码和ISO8859-1兼容Q就不会出现q了?br><br>再说Z码,啊的Z码是1601Q写?6q制?x10,0x01。这和计机q泛使用的ASCII~码冲突。ؓ了兼?0-7f的ASCII~码Q我们在Z码的高、低字节上分别加上A0。这?#8220;?#8221;的编码就成ؓB0A1。我们将加过两个A0的编码也UCؓGB2312~码Q虽然GB2312的原文根本没提到q一炏V?span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p></o:p></span></p> <img src ="http://www.shnenglu.com/woaidongmao/aggbug/66242.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:14 <a href="http://www.shnenglu.com/woaidongmao/archive/2008/11/07/66242.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss> <footer> <div class="friendship-link"> <p>лǵվܻԴȤ</p> <a href="http://www.shnenglu.com/" title="精品视频久久久久">精品视频久久久久</a> <div class="friend-links"> </div> </div> </footer> <a href="http://www.webidea.com.cn" target="_blank">޾ƷþþþĻһڣ</a>| <a href="http://www.shaosang.cn" target="_blank">99þþžžƷ</a>| <a href="http://www.tuomao8.cn" target="_blank">Ļ뾫ƷԴþ</a>| <a href="http://www.ggfuns.cn" target="_blank">ƯޱгĻþ</a>| <a href="http://www.hybtw.cn" target="_blank">Ʒþþþþþ˿</a>| <a href="http://www.yixue114.cn" target="_blank">Ҳȥþۺ</a>| <a href="http://www.xgpzgs8.cn" target="_blank">þþƷ鶹</a>| <a href="http://www.2pt.com.cn" target="_blank">˳þõӰվ</a>| <a href="http://www.269sihu.cn" target="_blank">ŷ޾þþþƷ</a>| <a href="http://www.cynh.net.cn" target="_blank">þù¶ƷӰ</a>| <a href="http://www.1yaofang.cn" target="_blank">þþþavר </a>| <a href="http://www.jinyiliu.cn" target="_blank">69Ʒþþþվ</a>| <a href="http://www.wyj340.cn" target="_blank">99þþùƷ޿</a>| <a href="http://www.tspride.cn" target="_blank">ƷѾþþþӰԺ</a>| <a href="http://www.1yaofang.cn" target="_blank">ƷþþþþӰԺ</a>| <a href="http://www.hlbcbuy.cn" target="_blank">Ʒþþþþ</a>| <a href="http://www.chunhuanhcl.cn" target="_blank">ɫ88þþþø߳ۺӰԺ</a>| <a href="http://www.2blood.cn" target="_blank">Ʒŷþþþ޹</a>| <a href="http://www.mashar.cn" target="_blank">ݺɫþþۺƵպ </a>| <a href="http://www.qiaokuo.cn" target="_blank">þþþAVվ</a>| <a href="http://www.leathvx.cn" target="_blank">þ㽶ۺɫһۺɫ88</a>| <a href="http://www.73sd.cn" target="_blank">޹Ʒ˾þ </a>| <a href="http://www.jxscool.cn" target="_blank">jizzjizzƷþ</a>| <a href="http://www.nanwx.cn" target="_blank">޹ƷAVþۺӰԺ</a>| <a href="http://www.shensizxw.cn" target="_blank">Ʒ޾þþþþ</a>| <a href="http://www.47jz.cn" target="_blank">þþþӰԺŮ</a>| <a href="http://www.depsys.cn" target="_blank">ŮHҳþþ</a>| <a href="http://www.0532ks.cn" target="_blank">þֻ⾫Ʒ99</a>| <a href="http://www.idhm.cn" target="_blank">þþƷAVɫ</a>| <a href="http://www.vzxu.cn" target="_blank">˾Ʒһþ</a>| <a href="http://www.angfei.com.cn" target="_blank">ƷŮٸaѾþ</a>| <a href="http://www.fzmnls.cn" target="_blank">þsmȤ</a>| <a href="http://www.yunkouzi.cn" target="_blank">ŷ޾þþþƷ</a>| <a href="http://www.ts71.cn" target="_blank">һAëƬѹۿþþƷ</a>| <a href="http://www.vxfawh.cn" target="_blank">Ʒһþ㽶߿</a>| <a href="http://www.duanchu.cn" target="_blank">þ㽶߿ۿ</a>| <a href="http://www.cqhthj.com.cn" target="_blank">ݺɫۺϾþ</a>| <a href="http://www.wxwyx.cn" target="_blank">ƷþþþӰԺɫ</a>| <a href="http://www.jm1818.cn" target="_blank">97þþƷҹһ</a>| <a href="http://www.9n7.com.cn" target="_blank">˾þں2019</a>| <a href="http://www.17wgame.cn" target="_blank">þþþùƷ </a>| <script> (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </body>