??xml version="1.0" encoding="utf-8" standalone="yes"?>99久久精品国产麻豆,国内精品久久人妻互换,曰曰摸天天摸人人看久久久http://www.shnenglu.com/true/category/3966.htmlzh-cnFri, 25 Sep 2009 13:58:55 GMTFri, 25 Sep 2009 13:58:55 GMT60C++l构体序列化的一Ҏ?/title><link>http://www.shnenglu.com/true/archive/2009/09/24/97087.html</link><dc:creator>true</dc:creator><author>true</author><pubDate>Wed, 23 Sep 2009 19:03:00 GMT</pubDate><guid>http://www.shnenglu.com/true/archive/2009/09/24/97087.html</guid><wfw:comment>http://www.shnenglu.com/true/comments/97087.html</wfw:comment><comments>http://www.shnenglu.com/true/archive/2009/09/24/97087.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.shnenglu.com/true/comments/commentRss/97087.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/true/services/trackbacks/97087.html</trackback:ping><description><![CDATA[     摘要: C++l构体序列化 libprotobuf  <a href='http://www.shnenglu.com/true/archive/2009/09/24/97087.html'>阅读全文</a><img src ="http://www.shnenglu.com/true/aggbug/97087.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/true/" target="_blank">true</a> 2009-09-24 03:03 <a href="http://www.shnenglu.com/true/archive/2009/09/24/97087.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>协议设计之二 l构体类型的~码http://www.shnenglu.com/true/archive/2009/09/11/95949.htmltruetrueFri, 11 Sep 2009 11:35:00 GMThttp://www.shnenglu.com/true/archive/2009/09/11/95949.htmlhttp://www.shnenglu.com/true/comments/95949.htmlhttp://www.shnenglu.com/true/archive/2009/09/11/95949.html#Feedback0http://www.shnenglu.com/true/comments/commentRss/95949.htmlhttp://www.shnenglu.com/true/services/trackbacks/95949.html阅读全文

true 2009-09-11 19:35 发表评论
]]>
协议设计之一 基本cd的编?/title><link>http://www.shnenglu.com/true/archive/2009/09/11/95873.html</link><dc:creator>true</dc:creator><author>true</author><pubDate>Thu, 10 Sep 2009 20:12:00 GMT</pubDate><guid>http://www.shnenglu.com/true/archive/2009/09/11/95873.html</guid><wfw:comment>http://www.shnenglu.com/true/comments/95873.html</wfw:comment><comments>http://www.shnenglu.com/true/archive/2009/09/11/95873.html#Feedback</comments><slash:comments>5</slash:comments><wfw:commentRss>http://www.shnenglu.com/true/comments/commentRss/95873.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/true/services/trackbacks/95873.html</trackback:ping><description><![CDATA[     摘要: 协议设计的基部分Q基本类型的~码Q参考了libprotobuf  <a href='http://www.shnenglu.com/true/archive/2009/09/11/95873.html'>阅读全文</a><img src ="http://www.shnenglu.com/true/aggbug/95873.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/true/" target="_blank">true</a> 2009-09-11 04:12 <a href="http://www.shnenglu.com/true/archive/2009/09/11/95873.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>字符转换Q{载)http://www.shnenglu.com/true/archive/2007/11/18/36895.htmltruetrueSun, 18 Nov 2007 11:48:00 GMThttp://www.shnenglu.com/true/archive/2007/11/18/36895.htmlhttp://www.shnenglu.com/true/comments/36895.htmlhttp://www.shnenglu.com/true/archive/2007/11/18/36895.html#Feedback0http://www.shnenglu.com/true/comments/commentRss/36895.htmlhttp://www.shnenglu.com/true/services/trackbacks/36895.html 

一 C++ ?string与wstring互{

Ҏ一Q?/p>

string WideToMutilByte(const wstring& _src)
{
int nBufSize = WideCharToMultiByte(GetACP(), 0, _src.c_str(),-1, NULL, 0, 0, FALSE);

char *szBuf = new char[nBufSize];

WideCharToMultiByte(GetACP(), 0, _src.c_str(),-1, szBuf, nBufSize, 0, FALSE);

string strRet(szBuf);

delete []szBuf;
szBuf = NULL;

return strRet;
}

wstring MutilByteToWide(const string& _src)
{
//计算字符?string 转成 wchar_t 之后占用的内存字节数
int nBufSize = MultiByteToWideChar(GetACP(),0,_src.c_str(),-1,NULL,0);

//?wsbuf 分配内存 BufSize 个字?br>wchar_t *wsBuf = new wchar_t[nBufSize];

//转化?unicode ?WideString
MultiByteToWideChar(GetACP(),0,_src.c_str(),-1,wsBuf,nBufSize);

wstring wstrRet(wsBuf);

delete []wsBuf;
wsBuf = NULL;

return wstrRet;
}

 


转蝲Qcsdn

q篇文章里,我将l出几种C++ std::string和std::wstring怺转换的{换方法?br> 
W一U方法:调用WideCharToMultiByte()和MultiByteToWideChar()Q代码如下(关于详细的解释,可以参考《windows核心~程》)Q?br> 

#include <string>
#include <windows.h>
using namespace std;
//Converting a WChar string to a Ansi string
std::string WChar2Ansi(LPCWSTR pwszSrc)
{
         int nLen = WideCharToMultiByte(CP_ACP, 0, pwszSrc, -1, NULL, 0, NULL, NULL);
 
         if (nLen<= 0) return std::string("");
 
         char* pszDst = new char[nLen];
         if (NULL == pszDst) return std::string("");
 
         WideCharToMultiByte(CP_ACP, 0, pwszSrc, -1, pszDst, nLen, NULL, NULL);
         pszDst[nLen -1] = 0;
 
         std::string strTemp(pszDst);
         delete [] pszDst;
 
         return strTemp;
}

 
string ws2s(wstring& inputws)
{
        return WChar2Ansi(inputws.c_str());
}

 

 
//Converting a Ansi string to WChar string


std::wstring Ansi2WChar(LPCSTR pszSrc, int nLen)
 
{
    int nSize = MultiByteToWideChar(CP_ACP, 0, (LPCSTR)pszSrc, nLen, 0, 0);
    if(nSize <= 0) return NULL;
 
         WCHAR *pwszDst = new WCHAR[nSize+1];
    if( NULL == pwszDst) return NULL;
 
    MultiByteToWideChar(CP_ACP, 0,(LPCSTR)pszSrc, nLen, pwszDst, nSize);
    pwszDst[nSize] = 0;
 
    if( pwszDst[0] == 0xFEFF)                    // skip Oxfeff
        for(int i = 0; i < nSize; i ++)
                            pwszDst[i] = pwszDst[i+1];
 
    wstring wcharString(pwszDst);
         delete pwszDst;
 
    return wcharString;
}

 
std::wstring s2ws(const string& s)
{
     return Ansi2WChar(s.c_str(),s.size());
}


 
 
W二U方法:采用ATL装_bstr_t的过渡:Q注Q_bstr_是Microsoft Specific的,所以下面代码可以在VS2005通过Q无UL性)Q?/p>


#include <string>
#include <comutil.h>
using namespace std;
#pragma comment(lib, "comsuppw.lib")
 
string ws2s(const wstring& ws);
wstring s2ws(const string& s);
 
string ws2s(const wstring& ws)
{
         _bstr_t t = ws.c_str();
         char* pchar = (char*)t;
         string result = pchar;
         return result;
}

 
wstring s2ws(const string& s)
{
         _bstr_t t = s.c_str();
         wchar_t* pwchar = (wchar_t*)t;
         wstring result = pwchar;
         return result;
}


 
W三U方法:使用CRT库的mbstowcs()函数和wcstombs()函数Q^台无养I需讑֮locale?/p>


#include <string>
#include <locale.h>
using namespace std;
string ws2s(const wstring& ws)
{
         string curLocale = setlocale(LC_ALL, NULL);        // curLocale = "C";
 
         setlocale(LC_ALL, "chs");
 
         const wchar_t* _Source = ws.c_str();
         size_t _Dsize = 2 * ws.size() + 1;
         char *_Dest = new char[_Dsize];
         memset(_Dest,0,_Dsize);
         wcstombs(_Dest,_Source,_Dsize);
         string result = _Dest;
         delete []_Dest;
 
         setlocale(LC_ALL, curLocale.c_str());
 
         return result;
}

 
wstring s2ws(const string& s)
{
         setlocale(LC_ALL, "chs");
 
         const char* _Source = s.c_str();
         size_t _Dsize = s.size() + 1;
         wchar_t *_Dest = new wchar_t[_Dsize];
         wmemset(_Dest, 0, _Dsize);
         mbstowcs(_Dest,_Source,_Dsize);
         wstring result = _Dest;
         delete []_Dest;
 
         setlocale(LC_ALL, "C");
 
         return result;
}


?utf8.utf16.utf32的相互{?/p>

可以参考Unicode.org 上有ConvertUTF.c和ConvertUTF.h Q下载地址Q?a >http://www.unicode.org/Public/PROGRAMS/CVTUTF/Q?/p>

实现文gConvertUTF.cQ(.h省)
/**//*
 * Copyright 2001-2004 Unicode, Inc.
 *
 * Disclaimer
 *
 * This source code is provided as is by Unicode, Inc. No claims are
 * made as to fitness for any particular purpose. No warranties of any
 * kind are expressed or implied. The recipient agrees to determine
 * applicability of information provided. If this file has been
 * purchased on magnetic or optical media from Unicode, Inc., the
 * sole remedy for any claim will be exchange of defective media
 * within 90 days of receipt.
 *
 * Limitations on Rights to Redistribute This Code
 *
 * Unicode, Inc. hereby grants the right to freely use the information
 * supplied in this file in the creation of products supporting the
 * Unicode Standard, and to make copies of this file in any form
 * for internal or external distribution as long as this notice
 * remains attached.
 */

/**//* ---------------------------------------------------------------------

    Conversions between UTF32, UTF-16, and UTF-8. Source code file.
    Author: Mark E. Davis, 1994.
    Rev History: Rick McGowan, fixes & updates May 2001.
    Sept 2001: fixed const & error conditions per
    mods suggested by S. Parent & A. Lillich.
    June 2002: Tim Dodd added detection and handling of incomplete
    source sequences, enhanced error detection, added casts
    to eliminate compiler warnings.
    July 2003: slight mods to back out aggressive FFFE detection.
    Jan 2004: updated switches in from-UTF8 conversions.
    Oct 2004: updated to use UNI_MAX_LEGAL_UTF32 in UTF-32 conversions.

    See the header file "ConvertUTF.h" for complete documentation.

------------------------------------------------------------------------ */


#include "ConvertUTF.h"
#ifdef CVTUTF_DEBUG
#include <stdio.h>
#endif

static const int halfShift  = 10; /**//* used for shifting by 10 bits */

static const UTF32 halfBase = 0x0010000UL;
static const UTF32 halfMask = 0x3FFUL;

#define UNI_SUR_HIGH_START  (UTF32)0xD800
#define UNI_SUR_HIGH_END    (UTF32)0xDBFF
#define UNI_SUR_LOW_START   (UTF32)0xDC00
#define UNI_SUR_LOW_END     (UTF32)0xDFFF
#define false       0
#define true        1

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF32toUTF16 (
    const UTF32** sourceStart, const UTF32* sourceEnd,
    UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF32* source = *sourceStart;
    UTF16* target = *targetStart;
    while (source < sourceEnd) {
    UTF32 ch;
    if (target >= targetEnd) {
        result = targetExhausted; break;
    }
    ch = *source++;
    if (ch <= UNI_MAX_BMP) { /**//* Target is a character <= 0xFFFF */
        /**//* UTF-16 surrogate values are illegal in UTF-32; 0xffff or 0xfffe are both reserved values */
        if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END) {
        if (flags == strictConversion) {
            --source; /**//* return to the illegal value itself */
            result = sourceIllegal;
            break;
        } else {
            *target++ = UNI_REPLACEMENT_CHAR;
        }
        } else {
        *target++ = (UTF16)ch; /**//* normal case */
        }
    } else if (ch > UNI_MAX_LEGAL_UTF32) {
        if (flags == strictConversion) {
        result = sourceIllegal;
        } else {
        *target++ = UNI_REPLACEMENT_CHAR;
        }
    } else {
        /**//* target is a character in range 0xFFFF - 0x10FFFF. */
        if (target + 1 >= targetEnd) {
        --source; /**//* Back up source pointer! */
        result = targetExhausted; break;
        }
        ch -= halfBase;
        *target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START);
        *target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START);
    }
    }
    *sourceStart = source;
    *targetStart = target;
    return result;
}

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF16toUTF32 (
    const UTF16** sourceStart, const UTF16* sourceEnd,
    UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF16* source = *sourceStart;
    UTF32* target = *targetStart;
    UTF32 ch, ch2;
    while (source < sourceEnd) {
    const UTF16* oldSource = source; /**//*  In case we have to back up because of target overflow. */
    ch = *source++;
    /**//* If we have a surrogate pair, convert to UTF32 first. */
    if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
        /**//* If the 16 bits following the high surrogate are in the source buffer */
        if (source < sourceEnd) {
        ch2 = *source;
        /**//* If it's a low surrogate, convert to UTF32. */
        if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
            ch = ((ch - UNI_SUR_HIGH_START) << halfShift)
            + (ch2 - UNI_SUR_LOW_START) + halfBase;
            ++source;
        } else if (flags == strictConversion) { /**//* it's an unpaired high surrogate */
            --source; /**//* return to the illegal value itself */
            result = sourceIllegal;
            break;
        }
        } else { /**//* We don't have the 16 bits following the high surrogate. */
        --source; /**//* return to the high surrogate */
        result = sourceExhausted;
        break;
        }
    } else if (flags == strictConversion) {
        /**//* UTF-16 surrogate values are illegal in UTF-32 */
        if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END) {
        --source; /**//* return to the illegal value itself */
        result = sourceIllegal;
        break;
        }
    }
    if (target >= targetEnd) {
        source = oldSource; /**//* Back up source pointer! */
        result = targetExhausted; break;
    }
    *target++ = ch;
    }
    *sourceStart = source;
    *targetStart = target;
#ifdef CVTUTF_DEBUG
if (result == sourceIllegal) {
    fprintf(stderr, "ConvertUTF16toUTF32 illegal seq 0x%04x,%04x\n", ch, ch2);
    fflush(stderr);
}
#endif
    return result;
}

/**//* --------------------------------------------------------------------- */

/**//*
 * Index into the table below with the first byte of a UTF-8 sequence to
 * get the number of trailing bytes that are supposed to follow it.
 * Note that *legal* UTF-8 values can't have 4 or 5-bytes. The table is
 * left as-is for anyone who may want to do such conversion, which was
 * allowed in earlier algorithms.
 */
static const char trailingBytesForUTF8[256] = {
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

/**//*
 * Magic values subtracted from a buffer value during UTF8 conversion.
 * This table contains as many values as there might be trailing bytes
 * in a UTF-8 sequence.
 */
static const UTF32 offsetsFromUTF8[6] = { 0x00000000UL, 0x00003080UL, 0x000E2080UL,
             0x03C82080UL, 0xFA082080UL, 0x82082080UL };

/**//*
 * Once the bits are split out into bytes of UTF-8, this is a mask OR-ed
 * into the first byte, depending on how many bytes follow.  There are
 * as many entries in this table as there are UTF-8 sequence types.
 * (I.e., one byte sequence, two byte etc.). Remember that sequencs
 * for *legal* UTF-8 will be 4 or fewer bytes total.
 */
static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };

/**//* --------------------------------------------------------------------- */

/**//* The interface converts a whole buffer to avoid function-call overhead.
 * Constants have been gathered. Loops & conditionals have been removed as
 * much as possible for efficiency, in favor of drop-through switches.
 * (See "Note A" at the bottom of the file for equivalent code.)
 * If your compiler supports it, the "isLegalUTF8" call can be turned
 * into an inline function.
 */

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF16toUTF8 (
    const UTF16** sourceStart, const UTF16* sourceEnd,
    UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF16* source = *sourceStart;
    UTF8* target = *targetStart;
    while (source < sourceEnd) {
    UTF32 ch;
    unsigned short bytesToWrite = 0;
    const UTF32 byteMask = 0xBF;
    const UTF32 byteMark = 0x80;
    const UTF16* oldSource = source; /**//* In case we have to back up because of target overflow. */
    ch = *source++;
    /**//* If we have a surrogate pair, convert to UTF32 first. */
    if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
        /**//* If the 16 bits following the high surrogate are in the source buffer */
        if (source < sourceEnd) {
        UTF32 ch2 = *source;
        /**//* If it's a low surrogate, convert to UTF32. */
        if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
            ch = ((ch - UNI_SUR_HIGH_START) << halfShift)
            + (ch2 - UNI_SUR_LOW_START) + halfBase;
            ++source;
        } else if (flags == strictConversion) { /**//* it's an unpaired high surrogate */
            --source; /**//* return to the illegal value itself */
            result = sourceIllegal;
            break;
        }
        } else { /**//* We don't have the 16 bits following the high surrogate. */
        --source; /**//* return to the high surrogate */
        result = sourceExhausted;
        break;
        }
    } else if (flags == strictConversion) {
        /**//* UTF-16 surrogate values are illegal in UTF-32 */
        if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END) {
        --source; /**//* return to the illegal value itself */
        result = sourceIllegal;
        break;
        }
    }
    /**//* Figure out how many bytes the result will require */
    if (ch < (UTF32)0x80) {         bytesToWrite = 1;
    } else if (ch < (UTF32)0x800) {     bytesToWrite = 2;
    } else if (ch < (UTF32)0x10000) {   bytesToWrite = 3;
    } else if (ch < (UTF32)0x110000) {  bytesToWrite = 4;
    } else {                bytesToWrite = 3;
                        ch = UNI_REPLACEMENT_CHAR;
    }

    target += bytesToWrite;
    if (target > targetEnd) {
        source = oldSource; /**//* Back up source pointer! */
        target -= bytesToWrite; result = targetExhausted; break;
    }
    switch (bytesToWrite) { /**//* note: everything falls through. */
        case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 1: *--target =  (UTF8)(ch | firstByteMark[bytesToWrite]);
    }
    target += bytesToWrite;
    }
    *sourceStart = source;
    *targetStart = target;
    return result;
}

/**//* --------------------------------------------------------------------- */

/**//*
 * Utility routine to tell whether a sequence of bytes is legal UTF-8.
 * This must be called with the length pre-determined by the first byte.
 * If not calling this from ConvertUTF8to*, then the length can be set by:
 *  length = trailingBytesForUTF8[*source]+1;
 * and the sequence is illegal right away if there aren't that many bytes
 * available.
 * If presented with a length > 4, this returns false.  The Unicode
 * definition of UTF-8 goes up to 4-byte sequences.
 */

static Boolean isLegalUTF8(const UTF8 *source, int length) {
    UTF8 a;
    const UTF8 *srcptr = source+length;
    switch (length) {
    default: return false;
    /**//* Everything else falls through when "true" */
    case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
    case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
    case 2: if ((a = (*--srcptr)) > 0xBF) return false;

    switch (*source) {
        /**//* no fall-through in this inner switch */
        case 0xE0: if (a < 0xA0) return false; break;
        case 0xED: if (a > 0x9F) return false; break;
        case 0xF0: if (a < 0x90) return false; break;
        case 0xF4: if (a > 0x8F) return false; break;
        default:   if (a < 0x80) return false;
    }

    case 1: if (*source >= 0x80 && *source < 0xC2) return false;
    }
    if (*source > 0xF4) return false;
    return true;
}

/**//* --------------------------------------------------------------------- */

/**//*
 * Exported function to return whether a UTF-8 sequence is legal or not.
 * This is not used here; it's just exported.
 */
Boolean isLegalUTF8Sequence(const UTF8 *source, const UTF8 *sourceEnd) {
    int length = trailingBytesForUTF8[*source]+1;
    if (source+length > sourceEnd) {
    return false;
    }
    return isLegalUTF8(source, length);
}

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF8toUTF16 (
    const UTF8** sourceStart, const UTF8* sourceEnd,
    UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF8* source = *sourceStart;
    UTF16* target = *targetStart;
    while (source < sourceEnd) {
    UTF32 ch = 0;
    unsigned short extraBytesToRead = trailingBytesForUTF8[*source];
    if (source + extraBytesToRead >= sourceEnd) {
        result = sourceExhausted; break;
    }
    /**//* Do this check whether lenient or strict */
    if (! isLegalUTF8(source, extraBytesToRead+1)) {
        result = sourceIllegal;
        break;
    }
    /**//*
     * The cases all fall through. See "Note A" below.
     */
    switch (extraBytesToRead) {
        case 5: ch += *source++; ch <<= 6; /**//* remember, illegal UTF-8 */
        case 4: ch += *source++; ch <<= 6; /**//* remember, illegal UTF-8 */
        case 3: ch += *source++; ch <<= 6;
        case 2: ch += *source++; ch <<= 6;
        case 1: ch += *source++; ch <<= 6;
        case 0: ch += *source++;
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    if (target >= targetEnd) {
        source -= (extraBytesToRead+1); /**//* Back up source pointer! */
        result = targetExhausted; break;
    }
    if (ch <= UNI_MAX_BMP) { /**//* Target is a character <= 0xFFFF */
        /**//* UTF-16 surrogate values are illegal in UTF-32 */
        if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END) {
        if (flags == strictConversion) {
            source -= (extraBytesToRead+1); /**//* return to the illegal value itself */
            result = sourceIllegal;
            break;
        } else {
            *target++ = UNI_REPLACEMENT_CHAR;
        }
        } else {
        *target++ = (UTF16)ch; /**//* normal case */
        }
    } else if (ch > UNI_MAX_UTF16) {
        if (flags == strictConversion) {
        result = sourceIllegal;
        source -= (extraBytesToRead+1); /**//* return to the start */
        break; /**//* Bail out; shouldn't continue */
        } else {
        *target++ = UNI_REPLACEMENT_CHAR;
        }
    } else {
        /**//* target is a character in range 0xFFFF - 0x10FFFF. */
        if (target + 1 >= targetEnd) {
        source -= (extraBytesToRead+1); /**//* Back up source pointer! */
        result = targetExhausted; break;
        }
        ch -= halfBase;
        *target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START);
        *target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START);
    }
    }
    *sourceStart = source;
    *targetStart = target;
    return result;
}

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF32toUTF8 (
    const UTF32** sourceStart, const UTF32* sourceEnd,
    UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF32* source = *sourceStart;
    UTF8* target = *targetStart;
    while (source < sourceEnd) {
    UTF32 ch;
    unsigned short bytesToWrite = 0;
    const UTF32 byteMask = 0xBF;
    const UTF32 byteMark = 0x80;
    ch = *source++;
    if (flags == strictConversion ) {
        /**//* UTF-16 surrogate values are illegal in UTF-32 */
        if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END) {
        --source; /**//* return to the illegal value itself */
        result = sourceIllegal;
        break;
        }
    }
    /**//*
     * Figure out how many bytes the result will require. Turn any
     * illegally large UTF32 things (> Plane 17) into replacement chars.
     */
    if (ch < (UTF32)0x80) {         bytesToWrite = 1;
    } else if (ch < (UTF32)0x800) {     bytesToWrite = 2;
    } else if (ch < (UTF32)0x10000) {   bytesToWrite = 3;
    } else if (ch <= UNI_MAX_LEGAL_UTF32) {  bytesToWrite = 4;
    } else {                bytesToWrite = 3;
                        ch = UNI_REPLACEMENT_CHAR;
                        result = sourceIllegal;
    }
   
    target += bytesToWrite;
    if (target > targetEnd) {
        --source; /**//* Back up source pointer! */
        target -= bytesToWrite; result = targetExhausted; break;
    }
    switch (bytesToWrite) { /**//* note: everything falls through. */
        case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
        case 1: *--target = (UTF8) (ch | firstByteMark[bytesToWrite]);
    }
    target += bytesToWrite;
    }
    *sourceStart = source;
    *targetStart = target;
    return result;
}

/**//* --------------------------------------------------------------------- */

ConversionResult ConvertUTF8toUTF32 (
    const UTF8** sourceStart, const UTF8* sourceEnd,
    UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags) {
    ConversionResult result = conversionOK;
    const UTF8* source = *sourceStart;
    UTF32* target = *targetStart;
    while (source < sourceEnd) {
    UTF32 ch = 0;
    unsigned short extraBytesToRead = trailingBytesForUTF8[*source];
    if (source + extraBytesToRead >= sourceEnd) {
        result = sourceExhausted; break;
    }
    /**//* Do this check whether lenient or strict */
    if (! isLegalUTF8(source, extraBytesToRead+1)) {
        result = sourceIllegal;
        break;
    }
    /**//*
     * The cases all fall through. See "Note A" below.
     */
    switch (extraBytesToRead) {
        case 5: ch += *source++; ch <<= 6;
        case 4: ch += *source++; ch <<= 6;
        case 3: ch += *source++; ch <<= 6;
        case 2: ch += *source++; ch <<= 6;
        case 1: ch += *source++; ch <<= 6;
        case 0: ch += *source++;
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    if (target >= targetEnd) {
        source -= (extraBytesToRead+1); /**//* Back up the source pointer! */
        result = targetExhausted; break;
    }
    if (ch <= UNI_MAX_LEGAL_UTF32) {
        /**//*
         * UTF-16 surrogate values are illegal in UTF-32, and anything
         * over Plane 17 (> 0x10FFFF) is illegal.
         */
        if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END) {
        if (flags == strictConversion) {
            source -= (extraBytesToRead+1); /**//* return to the illegal value itself */
            result = sourceIllegal;
            break;
        } else {
            *target++ = UNI_REPLACEMENT_CHAR;
        }
        } else {
        *target++ = ch;
        }
    } else { /**//* i.e., ch > UNI_MAX_LEGAL_UTF32 */
        result = sourceIllegal;
        *target++ = UNI_REPLACEMENT_CHAR;
    }
    }
    *sourceStart = source;
    *targetStart = target;
    return result;
}

/**//* ---------------------------------------------------------------------

    Note A.
    The fall-through switches in UTF-8 reading code save a
    temp variable, some decrements & conditionals.  The switches
    are equivalent to the following loop:
    {
        int tmpBytesToRead = extraBytesToRead+1;
        do {
        ch += *source++;
        --tmpBytesToRead;
        if (tmpBytesToRead) ch <<= 6;
        } while (tmpBytesToRead > 0);
    }
    In UTF-8 writing code, the switches on "bytesToWrite" are
    similarly unrolled loops.

   --------------------------------------------------------------------- */

 

?C++ 的字W串与C#的{?/p>

1Q将system::String 转化为C++的stringQ?br>// convert_system_string.cpp
// compile with: /clr
#include <string>
#include <iostream>
using namespace std;
using namespace System;

void MarshalString ( String ^ s, string& os ) {
   using namespace Runtime::InteropServices;
   const char* chars =
      (const char*)(Marshal::StringToHGlobalAnsi(s)).ToPointer();
   os = chars;
   Marshal::FreeHGlobal(IntPtr((void*)chars));
}

void MarshalString ( String ^ s, wstring& os ) {
   using namespace Runtime::InteropServices;
   const wchar_t* chars =
      (const wchar_t*)(Marshal::StringToHGlobalUni(s)).ToPointer();
   os = chars;
   Marshal::FreeHGlobal(IntPtr((void*)chars));
}

int main() {
   string a = "test";
   wstring b = L"test2";
   String ^ c = gcnew String("abcd");

   cout << a << endl;
   MarshalString(c, a);
   c = "efgh";
   MarshalString(c, b);
   cout << a << endl;
   wcout << b << endl;
}


2Q将System::String转化为char*或w_char*
// convert_string_to_wchar.cpp
// compile with: /clr
#include < stdio.h >
#include < stdlib.h >
#include < vcclr.h >

using namespace System;

int main() {
   String ^str = "Hello";

   // Pin memory so GC can't move it while native function is called
   pin_ptr<const wchar_t> wch = PtrToStringChars(str);
   printf_s("%S\n", wch);

   // Conversion to char* :
   // Can just convert wchar_t* to char* using one of the
   // conversion functions such as:
   // WideCharToMultiByte()
   // wcstombs_s()
   //  etc
   size_t convertedChars = 0;
   size_t  sizeInBytes = ((str->Length + 1) * 2);
   errno_t err = 0;
   char    *ch = (char *)malloc(sizeInBytes);

   err = wcstombs_s(&convertedChars,
                    ch, sizeInBytes,
                    wch, sizeInBytes);
   if (err != 0)
      printf_s("wcstombs_s  failed!\n");

    printf_s("%s\n", ch);
}



true 2007-11-18 19:48 发表评论
]]>
utf8~码?/title><link>http://www.shnenglu.com/true/archive/2007/04/05/21335.html</link><dc:creator>true</dc:creator><author>true</author><pubDate>Thu, 05 Apr 2007 09:23:00 GMT</pubDate><guid>http://www.shnenglu.com/true/archive/2007/04/05/21335.html</guid><wfw:comment>http://www.shnenglu.com/true/comments/21335.html</wfw:comment><comments>http://www.shnenglu.com/true/archive/2007/04/05/21335.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.shnenglu.com/true/comments/commentRss/21335.html</wfw:commentRss><trackback:ping>http://www.shnenglu.com/true/services/trackbacks/21335.html</trackback:ping><description><![CDATA[<p align=center><font size=5>utf8的编码算?/font><br>作者:转蝲    转脓自:转蝲    点击敎ͼ827    文章录入Q?zhaizl </p> <p> <blockquote></blockquote><br><br> <p dir=ltr style="MARGIN-RIGHT: 0px"><br>          <br>例如字符"?的unicode?C49Q把q个unicode字符表示Z个大整数Q然后{变成多字节编?10110001001001Q?br>          <br>观察q个整数的二q制码序列(110Q?10001Q?01001Q?br>          从后往前取<br>          <br>如果q个二进制序列只有后7位(于128,也就是ascii字符Q则直接取后7位二q制数Ş成一个utf8字符?br>          <br>上面的字W?#8220;?#8221;二进制序列大?位,所以取??1001001)Q加10形成一个utf8字节Q?0 001001 ,16q制89Q?br>          <br>剩下的二q制序列Q?10Q?10001Q从后向前取6位,?0形成一个utf8字节Q?0 110001Q?6q制B1Q?br>          <br>剩下的二q制序列Q?10Q从后向前取6位,׃不6位,这个数?110000相或Q得到字W?1100110Q?6q制E6<br>          <br>最后,得Cutf8~码Q?6q制表示为E6B189</p> <br> <div id="h1hj3pj" class=tit twffan="done">解读UTF8~码</div> <div id="n3dp333" class=date twffan="done">2007-01-19 10:40</div> <table style="TABLE-LAYOUT: fixed"> <tbody> <tr> <td> <div id="bn3fxpj" class=cnt twffan="done"> <p>在网l中有很多地斚w有采用UTF8~码Q由于要~写与邮件服务端有关的程序,而邮件服务端有些地方用到了UTF8~码Q所以对它有了初步的认识Q?br><br>它其实和Unicode是同c,是在编码方式上不同Q?br>首先UTF8~码后的大小是不一定,不像Unicode~码后的大小是一LQ?nbsp;<br>我们先来看Unicode的编码:一个英文字?nbsp;“a” 和 一个汉?nbsp;“?#8221;Q编码后都是占用的空间大是一LQ都是两个字节!<br><br>而UTF8~码Q一个英文字?#8220;a” 和 一个汉?nbsp;“?#8221;Q编码后占用的空间大就不样了,前者是一个字节,后者是三个字节Q?br><br>现在p我们来看看UTF8~码的原理吧Q?br>  因ؓ一个字母还有一些键盘上的符号加h只用二进制七位就可以表示出来Q而一个字节就是八位,所以UTF8q一个字节来表式字母和一些键盘上的符受然而当我们拿到被编码后的一个字节后怎么知道它的l成Q它有可能是英文字母的一个字节,也有可能是汉字的三个字节中的一个字节!所以,UTF8是有标志位的Q?br><br>  当要表示的内Ҏ 7位 的时候就用一个字节:0*******  W一?为标志位Q剩下的I间正好可以表示ASCII 0Q?27 的内宏V?br><br>  当要表示的内容在 8 到 11 位的时候就用两个字节:110***** 10******  W一个字节的110和第二个字节?0为标志位?br><br>  当要表示的内容在 12 到 16 位的时候就用三个字节:1110***** 10****** 10******    和上面一PW一个字节的1110和第二、三个字节的10都是标志位,剩下的空间正好可以表C汉字?br><br>  以此cLQ?br>四个字节Q?1110**** 10****** 10****** 10****** <br>  五个字节Q?11110*** 10****** 10****** 10****** 10****** <br>  六个字节Q?111110** 10****** 10****** 10****** 10****** 10****** <br>  .............................................<br> ..............................................<br><br>明白了没有?<br>~码的方法是从低位到高位<br><br>现在p我们来看看实例吧Q?br><br>U色为标志位<br>其它着色ؓ了显C其Q编码后的位|?nbsp;<br></p> <p> <table height=138 cellSpacing=0 cellPadding=0 width=765 border=1> <tbody> <tr> <td> <p align=center>Unicode十六q制</p> </td> <td><br> <p align=center>Unicode二进?/p> </td> <td><br> <p align=center>UTF8二进?/p> </td> <td><br> <p align=center>UTF8十六q制</p> </td> <td><br> <p align=center>UTF8字节?/p> </td> </tr> <tr> <td><br> <p align=center>B</p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffc0cb">00001011</font></p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffff00">0</font><font style="BACKGROUND-COLOR: #ffc0cb">0001010</font></p> </td> <td><br> <p align=center>B</p> </td> <td><br> <p align=center>1</p> </td> </tr> <tr> <td><br> <p align=center>9D</p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffc0cb">00010</font><font style="BACKGROUND-COLOR: #808080">011101</font></p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffff00">110</font><font style="BACKGROUND-COLOR: #ffc0cb">00010</font> <font style="BACKGROUND-COLOR: #ffff00">10</font><font style="BACKGROUND-COLOR: #808080">011101 </font></p> </td> <td><br> <p align=center>C2 9D</p> </td> <td><br> <p align=center>2</p> </td> </tr> <tr> <td><br> <p align=center>A89E</p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffc0cb">1010</font><font style="BACKGROUND-COLOR: #808080">1000 </font><font style="BACKGROUND-COLOR: #808080">10</font><font style="BACKGROUND-COLOR: #7fffd4">011110</font></p> </td> <td><br> <p align=center><font style="BACKGROUND-COLOR: #ffff00">1110</font><font style="BACKGROUND-COLOR: #ffc0cb">1010</font> <font style="BACKGROUND-COLOR: #ffff00">10</font><font style="BACKGROUND-COLOR: #808080">100010</font> <font style="BACKGROUND-COLOR: #ffff00">10</font><font style="BACKGROUND-COLOR: #7fffd4">011110</font></p> </td> <td><br> <p align=center>EA A2 9E</p> </td> <td><br> <p align=center>3</p> </td> </tr> </tbody> </table> </p> </div> </td> </tr> </tbody> </table> <img src ="http://www.shnenglu.com/true/aggbug/21335.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.shnenglu.com/true/" target="_blank">true</a> 2007-04-05 17:23 <a href="http://www.shnenglu.com/true/archive/2007/04/05/21335.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>字符Q编码的基本知识http://www.shnenglu.com/true/archive/2007/04/05/21334.htmltruetrueThu, 05 Apr 2007 09:14:00 GMThttp://www.shnenglu.com/true/archive/2007/04/05/21334.htmlhttp://www.shnenglu.com/true/comments/21334.htmlhttp://www.shnenglu.com/true/archive/2007/04/05/21334.html#Feedback0http://www.shnenglu.com/true/comments/commentRss/21334.htmlhttp://www.shnenglu.com/true/services/trackbacks/21334.html字符Q字节和~码

[原创文章Q{载请保留或注明出处:http://www.regexlab.com/zh/encoding.htm]

U别Q中U?/p>

摘要Q本文介l了字符与编码的发展q程Q相x늚正确理解。D例说明了一些实际应用中Q编码的实现Ҏ。然后,本文讲述了通常对字W与~码的几U误解,׃q些误解而导致ؕ码生的原因Q以及消除ؕ码的办法。本文的内容늛?#8220;中文问题”Q?#8220;q问题”?/p>

掌握~码问题的关键是正确地理解相x念,~码所涉及的技术其实是很简单的。因此,阅读本文旉要慢d惻I多思考?/p>

引言

“字符与编?#8221;是一个被l常讨论的话题。即使这P时常出现的ؕ码仍然困扰着大家。虽然我们有很多的办法可以用来消除ؕ码,但我们ƈ不一定理解这些办法的内在原理。而有的ؕ码生的原因Q实际上׃底层代码本n有问题所D的。因此,不仅是初学者会对字W编码感到模p,有的底层开发h员同样对字符~码~Z准确的理解?/p>

回页?/font>

1. ~码问题的由来,相关概念的理?/h4>

1.1 字符与编码的发展

从计机对多国语a的支持角度看Q大致可以分Z个阶D:

  pȝ内码 说明
阶段一 ASCII 计算机刚开始只支持pQ其它语a不能够在计算Z存储和显C?/td> 英文 DOS
阶段?/td> ANSI~码
Q本地化Q?/td>
Z计算机支持更多语aQ通常使用 0x80~0xFF 范围?2 个字节来表示 1 个字W。比如:汉字 '? 在中文操作系l中Q?[0xD6,0xD0] q两个字节存储?br>
不同的国家和地区制定了不同的标准Q由此生了 GB2312, BIG5, JIS {各自的~码标准。这些?2 个字节来代表一个字W的各种汉字延׾~码方式Q称?strong> ANSI ~码。在体中文系l下QANSI ~码代表 GB2312 ~码Q在日文操作pȝ下,ANSI ~码代表 JIS ~码?br>
不同 ANSI ~码之间互不兼容Q当信息在国际间交流Ӟ无法属于两U语a的文字,存储在同一D?strong> ANSI ~码的文本中?/td>
中文 DOSQ中?Windows 95/98Q日?Windows 95/98
阶段?/td> UNICODE
Q国际化Q?/td>
Z使国际间信息交流更加方便Q国际组l制定了 UNICODE 字符?/strong>Qؓ各种语言中的每一个字W设定了l一q且唯一的数字编P以满语言、跨q_q行文本转换、处理的要求?/td> Windows NT/2000/XPQLinuxQJava

字符串在内存中的存放ҎQ?/p>

?ASCII 阶段Q?strong>单字节字W串使用一个字节存放一个字W(SBCSQ。比如,"Bob123" 在内存中为:

42 6F 62 31 32 33 00
B o b 1 2 3 \0

在?ANSI ~码支持多种语言阶段Q每个字W用一个字节或多个字节来表C(MBCSQ,因此Q这U方式存攄字符也被UC多字节字W?/strong>。比如,"中文123" 在中?Windows 95 内存中ؓ7个字节,每个汉字?个字节,每个英文和数字字W占1个字节:

D6 D0 CE C4 31 32 33 00
?/td> ?/td> 1 2 3 \0

?UNICODE 被采用之后,计算机存攑֭W串Ӟ改ؓ存放每个字符?UNICODE 字符集中的序受目前计机一般?2 个字节(16 位)来存放一个序PDBCSQ,因此Q这U方式存攄字符也被UC宽字节字W?/strong>。比如,字符?"中文123" ?Windows 2000 下,内存中实际存攄?5 个序P

2D 4E 87 65 31 00 32 00 33 00 00 00      ← ?x86 CPU 中,低字节在?/font>
?/td> ?/td> 1 2 3 \0  

一共占 10 个字节?/p>

回页?/font>

1.2 字符Q字节,字符?/h5>

理解~码的关键,是要把字W的概念和字节的概念理解准确。这两个概念ҎhQ我们在此做一下区分:

  概念描述 举例
字符 Z使用的记P抽象意义上的一个符受?/td> '1', '?, 'a', '$', 'K?, ……
字节 计算Z存储数据的单元,一?位的二进制数Q是一个很具体的存储空间?/td> 0x01, 0x45, 0xFA, ……
ANSI
字符?/td>
在内存中Q如?#8220;字符”是以 ANSI ~码形式存在的,一个字W可能用一个字节或多个字节来表C,那么我们U这U字W串?ANSI 字符?/strong>或?strong>多字节字W串?/td> "中文123"
Q占7字节Q?/font>
UNICODE
字符?/td>
在内存中Q如?#8220;字符”是以?UNICODE 中的序号存在的,那么我们U这U字W串?UNICODE 字符?/strong>或?strong>宽字节字W串?/td> L"中文123"
Q占10字节Q?/font>

׃不同 ANSI ~码所规定的标准是不相同的Q因此,对于一个给定的多字节字W串Q我们必ȝ道它采用的是哪一U编码规则,才能够知道它包含了哪?#8220;字符”。而对?UNICODE 字符?/strong>来说Q不在什么环境下Q它所代表?#8220;字符”内容L不变的?/p>

回页?/font>

1.3 字符集与~码

各个国家和地区所制定的不?ANSI ~码标准中,都只规定了各自语a所需?#8220;字符”。比如:汉字标准QGB2312Q中没有规定韩国语字W怎样存储。这?ANSI ~码标准所规定的内容包含两层含义:

  1. 使用哪些字符。也是说哪些汉字,字母和符号会被收入标准中。所包含“字符”的集合就叫做“字符?/strong>”?
  2. 规定每个“字符”分别用一个字节还是多个字节存储,用哪些字节来存储Q这个规定就叫做“~码”?

各个国家和地区在制定~码标准的时候,“字符的集?#8221;?#8220;~码”一般都是同时制定的。因此,q_我们所说的“字符?#8221;Q比如:GB2312, GBK, JIS {,除了?#8220;字符的集?#8221;q层含义外,同时也包含了“~码”的含义?/p>

UNICODE 字符?/strong>”包含了各U语a中用到的所?#8220;字符”。用来给 UNICODE 字符集编码的标准有很多种Q比如:UTF-8, UTF-7, UTF-16, UnicodeLittle, UnicodeBig {?/p>

回页?/font>

1.4 常用的编码简?/h5>

单介l一下常用的~码规则Qؓ后边的章节做一个准备。在q里Q我们根据编码规则的特点Q把所有的~码分成三类Q?/p>
分类 ~码标准 说明
单字节字W编?/td> ISO-8859-1 最单的~码规则Q每一个字节直接作Z?UNICODE 字符。比如,[0xD6, 0xD0] q两个字节,通过 iso-8859-1 转化为字W串Ӟ直接得?[0x00D6, 0x00D0] 两个 UNICODE 字符Q即 "ÖÐ"?br>
反之Q将 UNICODE 字符串通过 iso-8859-1 转化为字节串Ӟ只能正常转化 0~255 范围的字W?/td>
ANSI ~码 GB2312,
BIG5,
Shift_JIS,
ISO-8859-2 ……
?UNICODE 字符串通过 ANSI ~码转化?#8220;字节?#8221;ӞҎ各自~码的规定,一?UNICODE 字符可能转化成一个字节或多个字节?br>
反之Q将字节串{化成字符串时Q也可能多个字节转化成一个字W。比如,[0xD6, 0xD0] q两个字节,通过 GB2312 转化为字W串Ӟ得?[0x4E2D] 一个字W,?'? 字?br>
“ANSI ~码”的特点:
1. q些“ANSI ~码标准”都只能处理各自语a范围之内?UNICODE 字符?br>2. “UNICODE 字符”?#8220;转换出来的字?#8221;之间的关pLZؓ规定的?/td>
UNICODE ~码 UTF-8,
UTF-16, UnicodeBig ……
?#8220;ANSI ~码”cM的,把字W串通过 UNICODE ~码转化?#8220;字节?#8221;Ӟ一?UNICODE 字符可能转化成一个字节或多个字节?br>
?#8220;ANSI ~码”不同的是Q?br>1. q些“UNICODE ~码”能够处理所有的 UNICODE 字符?br>2. “UNICODE 字符”?#8220;转换出来的字?#8221;之间是可以通过计算得到的?/td>

我们实际上没有必要去q每一U编码具体把某一个字W编码成了哪几个字节Q我们只需要知?#8220;~码”的概念就是把“字符”转化?#8220;字节”可以了。对?#8220;UNICODE ~码”Q由于它们是可以通过计算得到的,因此Q在Ҏ的场合,我们可以M解某一U?#8220;UNICODE ~码”是怎样的规则?/p>

回页?/font>

2. 字符与编码在E序中的实现

2.1 E序中的字符与字?/h5>

?C++ ?Java 中,用来代表“字符”?#8220;字节”的数据类型,以及q行~码的方法:

cd或操?/strong> C++ Java
字符 wchar_t char
字节 char byte
ANSI 字符?/td> char[] byte[]
UNICODE 字符?/td> wchar_t[] String
字节?#8594;字符?/td> mbstowcs(), MultiByteToWideChar() string = new String(bytes, "encoding")
字符?#8594;字节?/td> wcstombs(), WideCharToMultiByte() bytes = string.getBytes("encoding")

以上需要注意几点:

  1. Java 中的 char 代表一?#8220;UNICODE 字符Q宽字节字符Q?#8221;Q?C++ 中的 char 代表一个字节?
  2. MultiByteToWideChar() ?WideCharToMultiByte() ?Windows API 函数?

回页?/font>

2.2 C++ 中相兛_现方?/h5>

声明一D字W串帔RQ?/p>
// ANSI 字符Ԍ内容长度 7 字节
char
     sz[20] = "中文123";

// UNICODE 字符Ԍ内容长度 5 ?wchar_tQ?0 字节Q?/span>
wchar_t wsz[20] = L"\x4E2D\x6587\x0031\x0032\x0033";

UNICODE 字符串的 I/O 操作Q字W与字节的{换操作:

// q行时设定当?ANSI ~码QVC 格式
setlocale(LC_ALL, ".936");

// GCC 中格?/span>
setlocale(LC_ALL, "zh_CN.GBK");

// Visual C++ 中用小?%sQ按?setlocale 指定~码输出到文?br>// GCC 中用大?%S
fwprintf(fp, L"%s\n", wsz);

// ?UNICODE 字符串按?setlocale 指定的编码{换成字节
wcstombs(sz, wsz, 20);
// 把字节串按照 setlocale 指定的编码{换成 UNICODE 字符?br>
mbstowcs(wsz, sz, 20);

?Visual C++ 中,UNICODE 字符串常量有更简单的表示Ҏ。如果源E序的编码与当前默认 ANSI ~码不符Q则需要?#pragma setlocaleQ告诉编译器源程序用的~码Q?/p>
// 如果源程序的~码与当前默?ANSI ~码不一_
// 则需要此行,~译时用来指明当前源E序使用的编?/font>

#pragma setlocale
(".936")

// UNICODE 字符串常量,内容长度 10 字节
wchar_t wsz[20] = L"中文123";

以上需要注?#pragma setlocale ?setlocale(LC_ALL, "") 的作用是不同的,#pragma setlocale 在编译时起作用,setlocale() 在运行时起作用?/p>

回页?/font>

2.3 Java 中相兛_现方?/h5>

字符串类 String 中的内容?UNICODE 字符Ԍ

// Java 代码Q直接写中文
String
string = "中文123";

// 得到长度?5Q因为是 5 个字W?/span>
System.out.println(string.length());

字符?I/O 操作Q字W与字节转换操作。在 Java ?java.io.* 中,?#8220;Stream”l尾的类一般是用来操作“字节?#8221;的类Q以“Reader”Q?#8220;Writer”l尾的类一般是用来操作“字符?#8221;的类?/p>
// 字符串与字节串间怺转化

// 按照 GB2312 得到字节Q得到多字节字符Ԍ

byte
[] bytes = string.getBytes("GB2312");

// 从字节按?GB2312 得到 UNICODE 字符?/span>
string = new String(bytes, "GB2312");

// 要将 String 按照某种~码写入文本文gQ有两种ҎQ?br>
// W一U办法:?Stream cd入已l按照指定编码{化好的字节串

OutputStream os = new FileOutputStream("1.txt");
os.write(bytes);
os.close();

// W二U办法:构造指定编码的 Writer 来写入字W串
Writer ow = new OutputStreamWriter(new FileOutputStream("2.txt"), "GB2312");
ow.write(string);
ow.close();

/* 最后得到的 1.txt ?2.txt 都是 7 个字?*/

如果 java 的源E序~码与当前默?ANSI ~码不符Q则在编译的时候,需要指明一下源E序的编码。比如:

E:\>javac -encoding BIG5 Hello.java

以上需要注意区分源E序的编码与 I/O 操作的编码,前者是在编译时起作用,后者是在运行时起作用?/p>

回页?/font>

3. 几种误解Q以及ؕ码生的原因和解军_?/h4>

3.1 Ҏ产生的误?/h5>
  对编码的误解
误解一 在将“字节?#8221;转化?#8220;UNICODE 字符?#8221;Ӟ比如在读取文本文件时Q或者通过|络传输文本ӞҎ?#8220;字节?#8221;单地作ؓ单字节字W串Q采用每“一个字?#8221;是“一个字W?#8221;的方法进行{化?br>
而实际上Q在非英文的环境中,应该?#8220;字节?#8221;作ؓ ANSI 字符Ԍ采用适当的编码来得到 UNICODE 字符Ԍ有可?#8220;多个字节”才能得到“一个字W?#8221;?br>
通常Q一直在英文环境下做开发的E序员们Q容易有q种误解?/td>
误解?/td> ?DOSQWindows 98 {非 UNICODE 环境下,字符串都是以 ANSI ~码的字节Ş式存在的。这U以字节形式存在的字W串Q必ȝ道是哪种~码才能被正地使用。这使我们Ş成了一个惯性思维Q?#8220;字符串的~码”?br>
?UNICODE 被支持后QJava 中的 String 是以字符?#8220;序号”来存储的Q不是以“某种~码的字?#8221;来存储的Q因此已l不存在“字符串的~码”q个概念了。只有在“字符?#8221;?#8220;字节?#8221;转化Ӟ或者,一?#8220;字节?#8221;当成一?ANSI 字符串时Q才有编码的概念?br>
不少的h都有q个误解?/td>

W一U误解,往往是导致ؕ码生的原因。第二种误解Q往往D本来ҎU正的ؕ码问题变得更复杂?/p>

在这里,我们可以看到Q其中所讲的“误解一”Q即采用?#8220;一个字?#8221;是“一个字W?#8221;的{化方法,实际上也q同于采用 iso-8859-1 q行转化。因此,我们常常使用 bytes = string.getBytes("iso-8859-1") 来进行逆向操作Q得到原始的“字节?#8221;。然后再使用正确?ANSI ~码Q比?string = new String(bytes, "GB2312")Q来得到正确?#8220;UNICODE 字符?#8221;?/p>

回页?/font>

3.2 ?UNICODE E序在不同语a环境间移植时的ؕ?/h5>

?UNICODE E序中的字符Ԍ都是以某U?ANSI ~码形式存在的。如果程序运行时的语a环境与开发时的语a环境不同Q将会导?ANSI 字符串的昄p|?/p>

比如Q在日文环境下开发的?UNICODE 的日文程序界面,拿到中文环境下运行时Q界面上显CZؕ码。如果这个日文程序界面改为采?UNICODE 来记录字W串Q那么当在中文环境下q行Ӟ界面上将可以昄正常的日文?/p>

׃客观原因Q有时候我们必d中文操作pȝ下运行非 UNICODE 的日文YӞq时我们可以采用一些工P比如Q南极星QAppLocale {,暂时的模拟不同的语言环境?/p>

回页?/font>

3.3 |页提交字符?/h5>

当页面中的表单提交字W串Ӟ首先把字W串按照当前面的编码,转化成字节串。然后再每个字节{化成 "%XX" 的格式提交到 Web 服务器。比如,一个编码ؓ GB2312 的页面,提交 "? q个字符串时Q提交给服务器的内容?"%D6%D0"?/p>

在服务器端,Web 服务器把收到?"%D6%D0" 转化?[0xD6, 0xD0] 两个字节Q然后再Ҏ GB2312 ~码规则得到 "? 字?/p>

?Tomcat 服务器中Qrequest.getParameter() 得到qӞ常常是因为前面提到的“误解一”造成的。默认情况下Q当提交 "%D6%D0" l?Tomcat 服务器时Qrequest.getParameter() 返?[0x00D6, 0x00D0] 两个 UNICODE 字符Q而不是返回一?"? 字符。因此,我们需要?bytes = string.getBytes("iso-8859-1") 得到原始的字节串Q再?string = new String(bytes, "GB2312") 重新得到正确的字W串 "??/p>

回页?/font>

3.4 从数据库d字符?/h5>

通过数据库客LQ比?ODBC ?JDBCQ从数据库服务器中读取字W串Ӟ客户端需要从服务器获知所使用?ANSI ~码。当数据库服务器发送字节流l客LӞ客户端负责将字节按照正的~码转化?UNICODE 字符丌Ӏ?/p>

如果从数据库d字符串时得到qQ而数据库中存攄数据又是正确的,那么往往q是因ؓ前面提到?#8220;误解一”造成的。解决的办法q是通过 string = new String( string.getBytes("iso-8859-1"), "GB2312") 的方法,重新得到原始的字节串Q再重新使用正确的编码{化成字符丌Ӏ?/p>

回页?/font>

3.5 电子邮g中的字符?/h5>

当一D?Text 或?HTML 通过电子邮g传送时Q发送的内容首先通过一U指定的字符~码转化?#8220;字节?#8221;Q然后再?#8220;字节?#8221;通过一U指定的传输~码QContent-Transfer-EncodingQ进行{化得到另一?#8220;字节?#8221;。比如,打开一电子邮件源代码Q可以看到类似的内容Q?/p>
Content-Type: text/plain;
        charset="gb2312"
Content-Transfer-Encoding: base64

sbG+qcrQuqO17cf4yee74bGjz9W7+b3wudzA7dbQ0MQNCg0KvPKzxqO6uqO17cnnsaPW0NDEDQoNCg==

最常用?Content-Transfer-Encoding ?Base64 ?Quoted-Printable 两种。在对二q制文g或者中文文本进行{化时QBase64 得到?#8220;字节?#8221;?Quoted-Printable 更短。在对英文文本进行{化时QQuoted-Printable 得到?#8220;字节?#8221;?Base64 更短?/p>

邮g的标题,用了一U更短的格式来标?#8220;字符~码”?#8220;传输~码”。比如,标题内容?"?Q则在邮件源代码中表CZؓQ?/p>
// 正确的标题格?/span>
Subject: =?GB2312?B?1tA=?=

其中Q?/p>

  • W一?#8220;=?”?#8220;?”中间的部分指定了字符~码Q在q个例子中指定的?GB2312?
  • “?”?#8220;?”中间?#8220;B”代表 Base64。如果是“Q”则代?Quoted-Printable?
  • 最?#8220;?”?#8220;?=”之间的部分,是l过 GB2312 转化成字节串Q再l过 Base64 转化后的标题内容?

如果“传输~码”改ؓ Quoted-PrintableQ同P如果标题内容?"?Q?/p>
// 正确的标题格?/span>
Subject: =?GB2312?Q?=D6=D0?=

如果阅读邮g时出Cؕ码,一般是因ؓ“字符~码”?#8220;传输~码”指定有误Q或者是没有指定。比如,有的发邮件组件在发送邮件时Q标?"?Q?/p>
// 错误的标题格?/span>
Subject: =?ISO-8859-1?Q?=D6=D0?=

q样的表C,实际上是明确指明了标题ؓ [0x00D6, 0x00D0]Q即 "ÖÐ"Q而不?"??/p>

回页?/font>

4. 几种错误理解的纠?/h4>

误解Q?#8220;ISO-8859-1 是国际编码?”

非也。iso-8859-1 只是单字节字W集中最单的一U,也就?#8220;字节~号”?#8220;UNICODE 字符~号”一致的那种~码规则。当我们要把一?#8220;字节?#8221;转化?#8220;字符?#8221;Q而又不知道它是哪一U?ANSI ~码Ӟ先暂时地?#8220;每一个字?#8221;作ؓ“一个字W?#8221;q行转化Q不会造成信息丢失。然后再使用 bytes = string.getBytes("iso-8859-1") 的方法可恢复到原始的字节丌Ӏ?/p>

误解Q?#8220;Java 中,怎样知道某个字符串的内码Q?#8221;

Java 中,字符串类 java.lang.String 处理的是 UNICODE 字符Ԍ不是 ANSI 字符丌Ӏ我们只需要把字符串作?#8220;抽象的符L?#8221;来看待。因此不存在字符串的内码的问题?/p>

true 2007-04-05 17:14 发表评论
]]>base64~码http://www.shnenglu.com/true/archive/2007/04/05/21330.htmltruetrueThu, 05 Apr 2007 08:43:00 GMThttp://www.shnenglu.com/true/archive/2007/04/05/21330.htmlhttp://www.shnenglu.com/true/comments/21330.htmlhttp://www.shnenglu.com/true/archive/2007/04/05/21330.html#Feedback0http://www.shnenglu.com/true/comments/commentRss/21330.htmlhttp://www.shnenglu.com/true/services/trackbacks/21330.htmlBase64~码其实是将3?位字节{换ؓ4?位字?( 3*8 = 4*6 = 24 ) q?个六位字?
其实仍然??只不q高两位被设|ؓ0. 当一个字节只?位有效时,它的取值空间ؓ0
?2?ơ方? ?3,也就是说被{换的Base64~码的每一个编码的取值空间ؓ(0~63)
?
事实上,0~63之间的ASCII码有许多不可见字W,所以应该再做一个映,映射表ؓ
‘A‘ ~ ‘Z‘ ? ASCIIQ? ~ 25Q?
‘a’ ~ ‘z‘ ? ASCIIQ?6 ~ 51Q?
‘0’ ~ ‘9‘ ? ASCIIQ?2 ~ 61Q?
‘+‘ ? ASCIIQ?2Q?
‘/‘ ? ASCIIQ?3Q?
q样可以将3?位字节,转换?个可见字W?
具体的字节拆分方法ؓQ?图(d不好Q领会精?:-))
aaaaaabb ccccdddd eeffffff
~~~~~~~~ ~~~~~~~~ ~~~~~~~~
字节 1 字节 2 字节 3
||
\/
00aaaaaa 00bbcccc 00ddddee 00ffffff

注:上面的三个字节位原文Q下面四个字节ؓBase64~码Q其前两位均??
q样拆分的时候,原文的字节数量应该是3的倍数Q当q个条g不能满Ӟ用全零字?
补Q{化时Base64~码?号代替,q就是ؓ什么有些Base64~码以一个或两个{号l?
束的原因Q但{号最多有两个Q因为:如果F(origin)代表原文的字节数QF(remain)?
表余敎ͼ?
F(remain) = F(origin) MOD 3 成立?
所以F(remain)的可能取gؓ0,1,2.
如果?n = [F(origin) – F(remain)] / 3
当F(remain) = 0 Ӟ恰好转换?*n个字节的Base64~码?
当F(remain) = 1 Ӟ׃一个原文字节可以拆分ؓ属于两个Base64~码的字节,Z
让Base64~码?的倍数Q所以应该ؓ?个等受?
当F(remain) = 2 Ӟ׃两个原文字节可以拆分为属?个Base64~码的字节,同理Q?
应该补上一个等?nbsp;




true 2007-04-05 16:43 发表评论
]]>
Ʒþˬ| 99þһa| ƷŮþþþ| þ޹ƷAVϼ| þþþþþùѿ| ھƷѾþӰԺ| AAAþþþƷ| ˾Ʒþ޸岻 ˾Ʒþ޸岻 ˾Ʒþ | þþþþþþþþ| þƬѹۿ| þˬ˸߳AV | ۺҹҹþ| ˾þվ| þŷƷ| ھƷþþĻ| þþþþùƷ볬| ڵþ| ۺɫۺϾþۺ| ޾ƷNVþþþþþþþ| þҹҹ³³ƬӰ | þרƷ| ձŷþþþѲ| TOKYOۺϾþþƷ| Ʒþþþ9999| þþƷƷ | þþƵ| ٸ޾þþþþ4| ˼˼þ99ѾƷ6| þþƷ޾Ʒɫ| Ʒһþ| þۺϾɫۺվ| 2022Ʒþþþ| ޹˾þۺҰ| þþƷAV| ˾þþƷ鶹һ| 69Ʒþþþùۿ| þ99þ99Ʒӿ| ʵҶ԰׾ʾþ| þ×Ʒþþþþ| þþƷ޾Ʒŷ| ھƷþþþþ99|