精品久久久久中文字,久久97久久97精品免视看秋霞,久久久精品人妻一区二区三区四

可冰 — Thu, 29 Sep 2005 12:34:00 GMT

上周,我花了很多心思��用模板写了一个UTF-8与UNICODE�怺�转换的功�?见文�?/font>code.rar),刚开始感觉还可以,但这几天慢慢的觉�?��Z��么不直接提供两个函数�?�q�样不是��单方便吗?我这��L��设计又能带来额外的什么好处呢?刚开始我是想提供比较方便好用以及�Ҏ��扩展与维护的代码,但现在感觉到与直接提供C式的函数�q�没有多��额外的好处.或许�q�样的简单功能根本就用不着�q�样复杂的代码吧.正如Eric Raymond对C++的评价一�?�?使程序员們֐�于写复杂的代�?.
我想大家看看我的代码,�l�我一�Ҏ��见和��.

可冰 2005-09-29 20:34 发表评论

构思UTF-8解码模块

可冰 — Thu, 22 Sep 2005 15:24:00 GMT

惛_��C��个解码UTF-8格式文��为Unicode格式代码�?引擎",要用��h��方便��手.
但想了几天了,都没有一个合适的�Ҏ��来实�?
�?.....
今天先试着写了�?找找感觉,接着再想�?..

可冰 2005-09-22 23:24 发表评论

std::wfstream是怎么支持宽字�W�的?

可冰 — Thu, 22 Sep 2005 14:47:00 GMT

std::wfstream的定义�ؓ:
typedef basic_fstream<wchar_t, char_traits<wchar_t> > wfstream;
在读取字�W�时:
wfstream wfile( "wcharfile.txt" );
wchar_t wch = wfile.get();
按语义讲应该是读入两个字节内容的.但经输出��?它却只读入一个字�?�q�样和fstream�q�有什么分�?
到底在处理Unicode�~�码的文件时,应该如何使用宽字�W�流?

可冰 2005-09-22 22:47 发表评论

可冰 — Tue, 20 Sep 2005 12:39:00 GMT

可冰 2005-09-20 20:39 发表评论

UTF-8 �~�码格式�ȝ��

可冰 — Mon, 19 Sep 2005 12:03:00 GMT

[以下只是个�h的�ȝ��,如若有误,恌��指正,谢谢!]
下列字节串用来表�C�Z��个字�W? 用到哪个串取决于该字�W�在 Unicode 中的序号.

U+00000000 - U+0000007F:	0 xxxxxxx	0x - 7x
U+00000080 - U+000007FF:	110 xxxxx 10 xxxxxx	Cx 8x - Dx Bx
U+00000800 - U+0000FFFF:	1110 xxxx 10 xxxxxx 10 xxxxxx	Ex 8x 8x - Ex Bx Bx
U+00010000 - U+001FFFFF:	11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx	F0 8x 8x 8x - F7 Bx Bx Bx	很少�?/td>
U+00200000 - U+03FFFFFF:	111110 xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx	F8 8x 8x 8x 8x - FB Bx Bx Bx Bx
U+04000000 - U+7FFFFFFF:	1111110 x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx	FC 8x 8x 8x 8x 8x - FD Bx Bx Bx Bx Bx

* FE FF从未在编码中出现�q?
* 除第一个字节外,其余字节都在 0x80 �?0xBF范围�?每个字符的�v始位�|�用0xC0-0xD0,0xE0,0xF0�{�可以确�?验证前四位或八位),不在�q�一范围的即为单字节字符.凡是�?span style="color: rgb(153, 0, 0); font-weight: bold;">0x80 �?0xBF开头的都是后��字节,计数旉��要蟩�q?
* Unicode是一�U�编码表,只将字符指定�l�某一数字(Unicode做得�q�要更多一�?比如提供比较及显�C�等很多��法�{�等);
而UTF-8是编码方�?是定义如何表�C��ƈ存储指定�~�码的格�?
* UTF-8�~�码转换为Unicode�~�码: ��所有标志位去除,剩余位数若不��_��在高位补�?凑��32位即�?
* Unicode�~�码转换为UTF-8�~�码: 从低位开�?每取6位补两个�?0,不��6�?不算高位�?)则按字节长度补相应的字符标志�?�?10�?110�{?/font>

可冰 2005-09-19 20:03 发表评论

UTF types

可冰 — Mon, 19 Sep 2005 07:38:00 GMT

UTF	Estimated average storage required per page (3000 characters)
UTF-8	3 KB (1999) 5 KB (2003)	On average, English takes slightly over one unit per code point. Most Latin-script languages take about 1.1 bytes. Greek, Russian, Arabic and Hebrew take about 1.7 bytes, and most others (including Japanese, Chinese, Korean and Hindi) take about 3 bytes. Characters in surrogate space take 4 bytes, but as a proportion of all world text they will always be very rare.
UTF-16	6 KB	All of the most common characters in use for all modern writing systems are already represented with 2 bytes. Characters in surrogate space take 4 bytes, but as a proportion of all world text they will always be very rare.
UTF-32	12 KB	All take 4 bytes

[来源: http://icu.sourceforge.net/docs/papers/forms_of_unicode/]

UTF-8(ISO 10646-1) 有以下特�?

UCS 字符 U+0000 �?U+007F (ASCII) 被编码�ؓ字节 0x00 �?0x7F (ASCII 兼容). �q�意味着只包�?7 �?ASCII 字符的文件在 ASCII �?UTF-8 两种�~�码方式下是一��L��.
所�?span style="color: red;"> > U+007F �?UCS 字符被编码�ؓ一个或多个字节的串, 每个字节都有标记位集. 因此, ASCII 字节 (0x00-0x7F) 不可能作��Z�Q何其他字�W�的一部分.
表示�?ASCII 字符的多字节串的�W�一个字�?/span>��L��?0xC0 �?0xFD 的范围里, �q�指��个字�W�包含多��个字节. 多字节串�?span style="color: red;">其余字节都在 0x80 �?0xBF 范围�? �q��得重新同步非常容�? �q��ɾ~�码无国�? 且很��受丢失字节的媄�?
可以�~�入所有可能的 2³¹�?UCS 代码
UTF-8 �~�码字符理论上可以最多到 6 个字节长, 然�?16 �?BMP 字符最多只用到 3 字节�?
Bigendian UCS-4 字节串的排列��序是预定的.
字节 0xFE �?0xFF �?UTF-8 �~�码中从未用�?

下列字节串用来表�C�Z��个字�W? 用到哪个串取决于该字�W�在 Unicode 中的序号.

U-00000000 - U-0000007F:	0xxxxxxx
U-00000080 - U-000007FF:	110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF:	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF:	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF:	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF:	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

xxx 的位�|�由字符�~�码数的二进制表�C�的位填�? ��靠右的 x ��h��少的特�D�意�? 只用最短的那个��_��表达一个字�W�编码数的多字节�? 注意在多字节串中, �W�一个字节的开�?1"的数目就是整个串中字节的数目.

例如: Unicode 字符 U+00A9 = 1010 1001 (版权�W�号) �?UTF-8 里的�~�码�?

11000010 10101001 = 0xC2 0xA9

而字�W?U+2260 = 0010 0010 0110 0000 (不等�? �~�码�?

11100010 10001001 10100000 = 0xE2 0x89 0xA0

�q�种�~�码的官方名字拼写�ؓ UTF-8, 其中 UTF 代表 UCS Transformation Format. 请勿在�Q何文��中用其他名�?(比如 utf8 �?UTF_8) 来表�C?UTF-8, 当然除非你指的是一个变量名而不是这�U�编码本�w?

什么编�E�语�a�支持 Unicode?

在大�U?1993 �q�之后开发的大多数现代编�E�语�a�都有一个特别的数据�c�d��, 叫做 Unicode/ISO 10646-1 字符. �?Ada95 中叫 Wide_Character, �?Java 中叫 char.

ISO C 也详�l�说明了处理多字节编码和宽字�W?(wide characters) 的机�? 1994 �q?9 �?Amendment 1 to ISO C 发表时又加入了更�? �q�些机制主要是�ؓ各类东亚�~�码而设计的, 它们比处�?UCS 所需的要健壮得多. UTF-8 �?ISO C 标准调用多字节字�W�串的编码的一个例�? wchar_t �c�d��可以用来存放 Unicode 字符.
[来源: http://www.linuxforum.net/books/UTF-8-Unicode.html]

可冰 2005-09-19 15:38 发表评论

UTF serializations

可冰 — Mon, 19 Sep 2005 07:23:00 GMT

UTF-8	Inital `EF BB BF` is a signature, indicating that the rest of the file is UTF-8. Any `EF BF BE` is an error. A real ZWNBSP at the start of a file requires a signature first.
UTF-8N	All of the text is normal UTF-8; there is no signature. Inital `EF BB BF` is a ZWNBSP. Any `EF BF BE` is an error.
UTF-16	Initial `FE FF` is a signature indicating the rest of the text is big endian UTF-16. Initial `FF FE` is a signature indicating the rest of the text is little endian UTF-16. If neither of these are present, all of the text is big endian. A real ZWNBSP at the start of a file requires a signature first.
UTF-16BE	All of the text is big endian: there is no signature. Initial `FE FF` is a ZWNBSP. Any `FF FE` is an error.
UTF-16LE	All of the text is little endian: there is no signature. Initial `FF FE` is a ZWNBSP. Any `FE FF` is an error.
UTF-32	Initial `00 00 FE FF` is a signature indicating the rest of the text is big endian UTF-32. Initial `FF FE 00 00` is a signature indicating the rest of the text is little endian UTF-32. If neither of these are present, all of the text is big endian. A real ZWNBSP at the start of a file requires a signature first.
UTF-32BE	All of the text is big endian: there is no signature. Initial `00 00 FE FF` is a ZWNBSP. Any `FF FE 00 00` is an error.
UTF-32LE	All of the text is little endian: there is no signature. Initial `FF FE 00 00` is a ZWNBSP. Initial `00 00 FE FF` is an error.

Note: The italicized names are not yet registered, but are useful for reference.

[from: http://icu.sourceforge.net/docs/papers/forms_of_unicode/]

可冰 2005-09-19 15:23 发表评论