URL編碼
作者: Chandrasekhar Vuppalapati
翻譯:eastvc
下載源代碼
本文的目的是設計一個完成URL編碼的C++類。在我曾經的項目中,我需要從VC++ 6.0應用程序中POST數據,而這些數據需要進行URL編碼。我在MSDN中查找能根據提供的字符串生成URL編碼的相關類或API,但我沒有找到,因此我必須設計一個自己的URLEncode C++類。
URLEncoder.exe是一個使用URLEncode類的MFC對話框程序。
如何處理
一些特殊字符在Internet上傳送是件棘手的事情, 經URL編碼特殊處理,可以使所有字符安全地從Internet傳送。
例如,回車的ASCII值是13,在發送FORM數據時候這就認為是一行數據的結束。
通常,所有應用程序采用HTTP或HTTPS協議在客戶端和服務器端傳送數據。服務器端從客戶端接收數據有兩種基本方法:
1、數據可以從HTTP頭傳送(COOKIES或作為FORM數據發送)
2、可以包含在URL中的查詢部分
當數據包含在URL,它必須遵循URL語法進行編碼。在WEB服務器端,數據自動解碼??紤]一下下面的URL,哪個數據是作為查詢參數。
例如:http://WebSite/ResourceName?Data=Data
WebSite是URL名稱
ResourceName可以是ASP或Servlet名稱
Data是需要發送的數據。如果MIME類型是Content-Type: application/x-www-form-urlencoded,則要求進行編碼。
RFC 1738
RFC 1738指明了統一資源定位(URLs)中的字符應該是US-ASCII字符集的子集。這是受HTML的限制,另一方面,允許在文檔中使用所有ISO- 8859-1(ISO-Latin)字符集。這將意味著在HTML FORM里POST的數據(或作為查詢字串的一部分),所有HTML編碼必須被編碼。
ISO-8859-1 (ISO-Latin)字符集
在下表中,包含了完整的ISO-8859-1 (ISO-Latin)字符集,表格提供了每個字符范圍(10進制),描述,實際值,十六進制值,HTML結果。某個范圍中的字符是否安全。
Character range(decimal)
|
Type
|
Values
|
Safe/Unsafe
|
0-31 |
ASCII Control Characters |
These characters are not printable |
Unsafe |
32-47 |
Reserved Characters |
'' ''!?#$%&''()*+,-./ |
Unsafe |
48-57 |
ASCII Characters and Numbers |
0-9 |
Safe |
58-64 |
Reserved Characters |
:;<=>?@ |
Unsafe |
65-90 |
ASCII Characters |
A-Z |
Safe |
91-96 |
Reserved Characters |
[\]^_` |
Unsafe |
97-122 |
ASCII Characters |
a-z |
Safe |
123-126 |
Reserved Characters |
{|}~ |
Unsafe |
127 |
Control Characters |
'' '' |
Unsafe |
128-255 |
Non-ASCII Characters |
'' '' |
Unsafe |
所有不安全的ASCII字符都需要編碼,例如,范圍(32-47, 58-64, 91-96, 123-126)。
下表描述了這些字符為什么不安全。
Character
|
Unsafe Reason
|
Character Encode
|
"<" |
Delimiters around URLs in free text |
%3C |
> |
Delimiters around URLs in free text |
%3E |
. |
Delimits URLs in some systems |
%22 |
# |
It is used in the World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. |
%23 |
{ |
Gateways and other transport agents are known to sometimes modify such characters |
%7B |
} |
Gateways and other transport agents are known to sometimes modify such characters |
%7D |
| |
Gateways and other transport agents are known to sometimes modify such characters |
%7C |
\ |
Gateways and other transport agents are known to sometimes modify such characters |
%5C |
^ |
Gateways and other transport agents are known to sometimes modify such characters |
%5E |
~ |
Gateways and other transport agents are known to sometimes modify such characters |
%7E |
[ |
Gateways and other transport agents are known to sometimes modify such characters |
%5B |
] |
Gateways and other transport agents are known to sometimes modify such characters |
%5D |
` |
Gateways and other transport agents are known to sometimes modify such characters |
%60 |
+ |
Indicates a space (spaces cannot be used in a URL) |
%20 |
/ |
Separates directories and subdirectories |
%2F |
? |
Separates the actual URL and the parameters |
%3F |
& |
Separator between parameters specified in the URL |
%26 |
如何實現
字符的URL編碼是將字符轉換到8位16進制并在前面加上''%''前綴。例如,US-ASCII字符集中空格是10進制
的32或16進制的20,因此,URL編碼是%20。
URLEncode: URLEncode是一個C++類,來實現字符串的URL編碼。CURLEncode類包含如下函數:
isUnsafeString
decToHex
convert
URLEncode
URLEncode()函數完成編碼過程,URLEncode檢查每個字符,看是否安全。如果不安全將用%16進制值進行轉換并添加
到原始字符串中。
代碼片斷
:
class CURLEncode
{
private:
static CString csUnsafeString;
CString (char num, int radix);
bool isUnsafe(char compareChar);
CString convert(char val);
public:
CURLEncode() { };
virtual ~CURLEncode() { };
CString (CString vData);
};
bool CURLEncode::isUnsafe(char compareChar)
{
bool bcharfound = false;
char tmpsafeChar;
int m_strLen = 0;
m_strLen = csUnsafeString.GetLength();
for(int ichar_pos = 0; ichar_pos < m_strLen ;ichar_pos++)
{
tmpsafeChar = csUnsafeString.GetAt(ichar_pos);
if(tmpsafeChar == compareChar)
{
bcharfound = true;
break;
}
}
int char_ascii_value = 0;
//char_ascii_value = __toascii(compareChar);
char_ascii_value = (int) compareChar;
if(bcharfound == false && char_ascii_value > 32 &&
char_ascii_value < 123)
{
return false;
}
// found no unsafe chars, return false
else
{
return true;
}
return true;
}
CString CURLEncode::decToHex(char num, int radix)
{
int temp=0;
CString csTmp;
int num_char;
num_char = (int) num;
if (num_char < 0)
num_char = 256 + num_char;
while (num_char >= radix)
{
temp = num_char % radix;
num_char = (int)floor(num_char / radix);
csTmp = hexVals[temp];
}
csTmp += hexVals[num_char];
if(csTmp.GetLength() < 2)
{
csTmp += ''0'';
}
CString strdecToHex(csTmp);
// Reverse the String
strdecToHex.MakeReverse();
return strdecToHex;
}
CString CURLEncode::convert(char val)
{
CString csRet;
csRet += "%";
csRet += decToHex(val, 16);
return csRet;
}
參考:
URL編碼:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm.
RFC 1866: The HTML 2.0 規范 (純文本). 附錄包含了字符表:
http://www.rfc-editor.org/rfc/rfc1866.txt.
Web HTML 2.0 版本(RFC 1866) :
http://www.w3.org/MarkUp/html-spec/html-spec_13.html.
The HTML 3.2 (Wilbur) 建議:
http://www.w3.org/MarkUp/Wilbur/.
The HTML 4.0 建議:
http://www.w3.org/TR/REC-html40/.
W3C HTML 國際化區域:
http://www.w3.org/International/O-HTML.html.