[轉]TF-IDF與余弦相似性的應用（二）：找出相似文章

今天，我們再來研究另一個相關的問題。有些時候，除了找到關鍵詞，我們還希望找到與原文章相似的其他文章。比如，"Google新聞"在主新聞下方，還提供多條相似的新聞。

為了找出相似的文章，需要用到"余弦相似性"（cosine similiarity）。下面，我舉一個例子來說明，什么是"余弦相似性"。

為了簡單起見，我們先從句子著手。

　　句子A：我喜歡看電視，不喜歡看電影。
　　句子B：我不喜歡看電視，也不喜歡看電影。

請問怎樣才能計算上面兩句話的相似程度？

基本思路是：如果這兩句話的用詞越相似，它們的內容就應該越相似。因此，可以從詞頻入手，計算它們的相似程度。

第一步，分詞。

　　句子A：我/喜歡/看/電視，不/喜歡/看/電影。
　　句子B：我/不/喜歡/看/電視，也/不/喜歡/看/電影。

第二步，列出所有的詞。

　　我，喜歡，看，電視，電影，不，也。

第三步，計算詞頻。

　　句子A：我 1，喜歡 2，看 2，電視 1，電影 1，不 1，也 0。
　　句子B：我 1，喜歡 2，看 2，電視 1，電影 1，不 2，也 1。

第四步，寫出詞頻向量。

　　句子A：[1, 2, 2, 1, 1, 1, 0]
　　句子B：[1, 2, 2, 1, 1, 2, 1]

到這里，問題就變成了如何計算這兩個向量的相似程度。

我們可以把它們想象成空間中的兩條線段，都是從原點（[0, 0, ...]）出發，指向不同的方向。兩條線段之間形成一個夾角，如果夾角為0度，意味著方向相同、線段重合；如果夾角為90度，意味著形成直角，方向完全不相似；如果夾角為180度，意味著方向正好相反。因此，我們可以通過夾角的大小，來判斷向量的相似程度。夾角越小，就代表越相似。

以二維空間為例，上圖的a和b是兩個向量，我們要計算它們的夾角θ。余弦定理告訴我們，可以用下面的公式求得：

假定a向量是[x1, y1]，b向量是[x2, y2]，那么可以將余弦定理改寫成下面的形式：

數學家已經證明，余弦的這種計算方法對n維向量也成立。假定A和B是兩個n維向量，A是 [A1, A2, ..., An] ，B是 [B1, B2, ..., Bn] ，則A與B的夾角θ的余弦等于：

使用這個公式，我們就可以得到，句子A與句子B的夾角的余弦。

余弦值越接近1，就表明夾角越接近0度，也就是兩個向量越相似，這就叫"余弦相似性"。所以，上面的句子A和句子B是很相似的，事實上它們的夾角大約為20.3度。

由此，我們就得到了"找出相似文章"的一種算法：

　　（1）使用TF-IDF算法，找出兩篇文章的關鍵詞；
　　（2）每篇文章各取出若干個關鍵詞（比如20個），合并成一個集合，計算每篇文章對于這個集合中的詞的詞頻（為了避免文章長度的差異，可以使用相對詞頻）；
　　（3）生成兩篇文章各自的詞頻向量；
　　（4）計算兩個向量的余弦相似度，值越大就表示越相似。

"余弦相似度"是一種非常有用的算法，只要是計算兩個向量的相似程度，都可以采用它。

posted @ 2014-03-06 21:36 不會飛的鳥閱讀(264) | 評論 (0) | 編輯收藏

[轉]TF-IDF與余弦相似性的應用（一）：自動提取關鍵詞

這個標題看上去好像很復雜，其實我要談的是一個很簡單的問題。

有一篇很長的文章，我要用計算機提取它的關鍵詞（Automatic Keyphrase extraction），完全不加以人工干預，請問怎樣才能正確做到？

這個問題涉及到數據挖掘、文本處理、信息檢索等很多計算機前沿領域，但是出乎意料的是，有一個非常簡單的經典算法，可以給出令人相當滿意的結果。它簡單到都不需要高等數學，普通人只用10分鐘就可以理解，這就是我今天想要介紹的TF-IDF算法。

讓我們從一個實例開始講起。假定現在有一篇長文《中國的蜜蜂養殖》，我們準備用計算機提取它的關鍵詞。

一個容易想到的思路，就是找到出現次數最多的詞。如果某個詞很重要，它應該在這篇文章中多次出現。于是，我們進行"詞頻"（Term Frequency，縮寫為TF）統計。

結果你肯定猜到了，出現次數最多的詞是----"的"、"是"、"在"----這一類最常用的詞。它們叫做"停用詞"（stop words），表示對找到結果毫無幫助、必須過濾掉的詞。

假設我們把它們都過濾掉了，只考慮剩下的有實際意義的詞。這樣又會遇到了另一個問題，我們可能發現"中國"、"蜜蜂"、"養殖"這三個詞的出現次數一樣多。這是不是意味著，作為關鍵詞，它們的重要性是一樣的？

顯然不是這樣。因為"中國"是很常見的詞，相對而言，"蜜蜂"和"養殖"不那么常見。如果這三個詞在一篇文章的出現次數一樣多，有理由認為，"蜜蜂"和"養殖"的重要程度要大于"中國"，也就是說，在關鍵詞排序上面，"蜜蜂"和"養殖"應該排在"中國"的前面。

所以，我們需要一個重要性調整系數，衡量一個詞是不是常見詞。如果某個詞比較少見，但是它在這篇文章中多次出現，那么它很可能就反映了這篇文章的特性，正是我們所需要的關鍵詞。

用統計學語言表達，就是在詞頻的基礎上，要對每個詞分配一個"重要性"權重。最常見的詞（"的"、"是"、"在"）給予最小的權重，較常見的詞（"中國"）給予較小的權重，較少見的詞（"蜜蜂"、"養殖"）給予較大的權重。這個權重叫做"逆文檔頻率"（Inverse Document Frequency，縮寫為IDF），它的大小與一個詞的常見程度成反比。

知道了"詞頻"（TF）和"逆文檔頻率"（IDF）以后，將這兩個值相乘，就得到了一個詞的TF-IDF值。某個詞對文章的重要性越高，它的TF-IDF值就越大。所以，排在最前面的幾個詞，就是這篇文章的關鍵詞。

下面就是這個算法的細節。

第一步，計算詞頻。

考慮到文章有長短之分，為了便于不同文章的比較，進行"詞頻"標準化。

或者

第二步，計算逆文檔頻率。

這時，需要一個語料庫（corpus），用來模擬語言的使用環境。

如果一個詞越常見，那么分母就越大，逆文檔頻率就越小越接近0。分母之所以要加1，是為了避免分母為0（即所有文檔都不包含該詞）。log表示對得到的值取對數。

第三步，計算TF-IDF。

可以看到，TF-IDF與一個詞在文檔中的出現次數成正比，與該詞在整個語言中的出現次數成反比。所以，自動提取關鍵詞的算法就很清楚了，就是計算出文檔的每個詞的TF-IDF值，然后按降序排列，取排在最前面的幾個詞。

還是以《中國的蜜蜂養殖》為例，假定該文長度為1000個詞，"中國"、"蜜蜂"、"養殖"各出現20次，則這三個詞的"詞頻"（TF）都為0.02。然后，搜索Google發現，包含"的"字的網頁共有250億張，假定這就是中文網頁總數。包含"中國"的網頁共有62.3億張，包含"蜜蜂"的網頁為0.484億張，包含"養殖"的網頁為0.973億張。則它們的逆文檔頻率（IDF）和TF-IDF如下：

從上表可見，"蜜蜂"的TF-IDF值最高，"養殖"其次，"中國"最低。（如果還計算"的"字的TF-IDF，那將是一個極其接近0的值。）所以，如果只選擇一個詞，"蜜蜂"就是這篇文章的關鍵詞。

除了自動提取關鍵詞，TF-IDF算法還可以用于許多別的地方。比如，信息檢索時，對于每個文檔，都可以分別計算一組搜索詞（"中國"、"蜜蜂"、"養殖"）的TF-IDF，將它們相加，就可以得到整個文檔的TF-IDF。這個值最高的文檔就是與搜索詞最相關的文檔。

TF-IDF算法的優點是簡單快速，結果比較符合實際情況。缺點是，單純以"詞頻"衡量一個詞的重要性，不夠全面，有時重要的詞可能出現次數并不多。而且，這種算法無法體現詞的位置信息，出現位置靠前的詞與出現位置靠后的詞，都被視為重要性相同，這是不正確的。（一種解決方法是，對全文的第一段和每一段的第一句話，給予較大的權重。）

posted @ 2014-03-06 21:35 不會飛的鳥閱讀(248) | 評論 (0) | 編輯收藏

關于pthread_detach( )

man pthread_detach

pthread_t　　

類型定義： typedef unsigned long int pthread_t; 　　//come from /usr/include/bits/pthread.h

用途：pthread_t用于聲明線程ID。　sizeof (pthread_t) =4;

linux線程執行和windows不同，pthread有兩種狀態joinable狀態和unjoinable狀態

一個線程默認的狀態是joinable，如果線程是joinable狀態，當線程函數自己返回退出時或pthread_exit時都不會釋放線程所占用堆棧和線程描述符（總計8K多）。只有當你調用了pthread_join之后這些資源才會被釋放。
若是unjoinable狀態的線程，這些資源在線程函數退出時或pthread_exit時自動會被釋放。

unjoinable屬性可以在pthread_create時指定，或在線程創建后在線程中pthread_detach自己, 如：pthread_detach(pthread_self())，將狀態改為unjoinable狀態，確保資源的釋放。如果線程狀態為 joinable,需要在之后適時調用pthread_join.

pthread_self()函數用來獲取當前調用該函數的線程的線程ID
NAME
pthread_self - get the calling thread ID

SYNOPSIS
#include <pthread.h>

pthread_t pthread_self(void);

DESCRIPTION
The pthread_self() function shall return the thread ID of the calling thread.

RETURN VALUE
Refer to the DESCRIPTION.

ERRORS
No errors are defined.

The pthread_self() function shall not return an error code of [EINTR].

The following sections are informative.

/*example：test.c*/

#include<pthread.h>
#include<stdio.h>
#include<unistd.h>

void print_message( void *ptr );

int main( int argc, char *argv[] )
{
pthread_t thread_id;
while( 1 )
{
pthread_create( &thread_id, NULL, (void *)print_message, (void *)NULL );// 一個線程默認的狀態是joinable

}

return 0;
}

void print_message( void *ptr )
{
pthread_detach(pthread_self());//pthread_detach(pthread_self())，將狀態改為unjoinable狀態，確保資源的釋放
static int g;
printf("%d\n", g++);
pthread_exit(0) ;//pthread_exit時自動會被釋放

}

posted @ 2014-02-12 22:12 不會飛的鳥閱讀(2742) | 評論 (0) | 編輯收藏

linux下IPTABLES配置詳解

如果你的IPTABLES基礎知識還不了解,建議先去看看.

開始配置

我們來配置一個filter表的防火墻.

(1)查看本機關于IPTABLES的設置情況

[root@tp ~]# iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain RH-Firewall-1-INPUT (0 references)
target       prot opt source                 destination
ACCEPT       all    --    0.0.0.0/0              0.0.0.0/0
ACCEPT       icmp --    0.0.0.0/0              0.0.0.0/0             icmp type 255
ACCEPT       esp    --    0.0.0.0/0              0.0.0.0/0
ACCEPT       ah     --    0.0.0.0/0              0.0.0.0/0
ACCEPT       udp    --    0.0.0.0/0              224.0.0.251           udp dpt:5353
ACCEPT       udp    --    0.0.0.0/0              0.0.0.0/0             udp dpt:631
ACCEPT       all    --    0.0.0.0/0              0.0.0.0/0             state RELATED,ESTABLISHED
ACCEPT       tcp    --    0.0.0.0/0              0.0.0.0/0             state NEW tcp dpt:22
ACCEPT       tcp    --    0.0.0.0/0              0.0.0.0/0             state NEW tcp dpt:80
ACCEPT       tcp    --    0.0.0.0/0              0.0.0.0/0             state NEW tcp dpt:25
REJECT       all    --    0.0.0.0/0              0.0.0.0/0             reject-with icmp-host-prohibited
可以看出我在安裝linux時,選擇了有防火墻,并且開放了22,80,25端口.

如果你在安裝linux時沒有選擇啟動防火墻,是這樣的

[root@tp ~]# iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

什么規則都沒有.

(2)清除原有規則.

不管你在安裝linux時是否啟動了防火墻,如果你想配置屬于自己的防火墻,那就清除現在filter的所有規則.

[root@tp ~]# iptables -F 清除預設表filter中的所有規則鏈的規則
[root@tp ~]# iptables -X 清除預設表filter中使用者自定鏈中的規則

我們在來看一下

[root@tp ~]# iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

什么都沒有了吧,和我們在安裝linux時沒有啟動防火墻是一樣的.(提前說一句,這些配置就像用命令配置IP一樣,重起就會失去作用),怎么保存.

[root@tp ~]# /etc/rc.d/init.d/iptables save

這樣就可以寫到/etc/sysconfig/iptables文件里了.寫入后記得把防火墻重起一下,才能起作用.

[root@tp ~]# service iptables restart

現在IPTABLES配置表里什么配置都沒有了,那我們開始我們的配置吧

(3)設定預設規則

[root@tp ~]# iptables -p INPUT DROP

[root@tp ~]# iptables -p OUTPUT ACCEPT

[root@tp ~]# iptables -p FORWARD DROP
上面的意思是,當超出了IPTABLES里filter表里的兩個鏈規則(INPUT,FORWARD)時,不在這兩個規則里的數據包怎么處理呢,那就是DROP(放棄).應該說這樣配置是很安全的.我們要控制流入數據包

而對于OUTPUT鏈,也就是流出的包我們不用做太多限制,而是采取ACCEPT,也就是說,不在著個規則里的包怎么辦呢,那就是通過.

可以看出INPUT,FORWARD兩個鏈采用的是允許什么包通過,而OUTPUT鏈采用的是不允許什么包通過.

這樣設置還是挺合理的,當然你也可以三個鏈都DROP,但這樣做我認為是沒有必要的,而且要寫的規則就會增加.但如果你只想要有限的幾個規則是,如只做WEB服務器.還是推薦三個鏈都是DROP.

注:如果你是遠程SSH登陸的話,當你輸入第一個命令回車的時候就應該掉了.因為你沒有設置任何規則.

怎么辦,去本機操作唄!

(4)添加規則.

首先添加INPUT鏈,INPUT鏈的默認規則是DROP,所以我們就寫需要ACCETP(通過)的鏈

為了能采用遠程SSH登陸,我們要開啟22端口.

[root@tp ~]# iptables -A INPUT -p tcp --dport 22 -j ACCEPT

[root@tp ~]# iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT (注:這個規則,如果你把OUTPUT 設置成DROP的就要寫上這一部,好多人都是望了寫這一部規則導致,始終無法SSH.在遠程一下,是不是好了.

其他的端口也一樣,如果開啟了web服務器,OUTPUT設置成DROP的話,同樣也要添加一條鏈:

[root@tp ~]# iptables -A OUTPUT -p tcp --sport 80 -j ACCEPT ,其他同理.)

如果做了WEB服務器,開啟80端口.

[root@tp ~]# iptables -A INPUT -p tcp --dport 80 -j ACCEPT
如果做了郵件服務器,開啟25,110端口.

[root@tp ~]# iptables -A INPUT -p tcp --dport 110 -j ACCEPT
[root@tp ~]# iptables -A INPUT -p tcp --dport 25 -j ACCEPT
如果做了FTP服務器,開啟21端口

[root@tp ~]# iptables -A INPUT -p tcp --dport 21 -j ACCEPT

[root@tp ~]# iptables -A INPUT -p tcp --dport 20 -j ACCEPT

如果做了DNS服務器,開啟53端口

[root@tp ~]# iptables -A INPUT -p tcp --dport 53 -j ACCEPT

如果你還做了其他的服務器,需要開啟哪個端口,照寫就行了.

上面主要寫的都是INPUT鏈,凡是不在上面的規則里的,都DROP

允許icmp包通過,也就是允許ping,

[root@tp ~]# iptables -A OUTPUT -p icmp -j ACCEPT (OUTPUT設置成DROP的話)

[root@tp ~]# iptables -A INPUT -p icmp -j ACCEPT (INPUT設置成DROP的話)

允許loopback!(不然會導致DNS無法正常關閉等問題)

IPTABLES -A INPUT -i lo -p all -j ACCEPT (如果是INPUT DROP)
IPTABLES -A OUTPUT -o lo -p all -j ACCEPT(如果是OUTPUT DROP)

下面寫OUTPUT鏈,OUTPUT鏈默認規則是ACCEPT,所以我們就寫需要DROP(放棄)的鏈.

減少不安全的端口連接

[root@tp ~]# iptables -A OUTPUT -p tcp --sport 31337 -j DROP

[root@tp ~]# iptables -A OUTPUT -p tcp --dport 31337 -j DROP

有些些特洛伊木馬會掃描端口31337到31340(即黑客語言中的 elite 端口)上的服務。既然合法服務都不使用這些非標準端口來通信,阻塞這些端口能夠有效地減少你的網絡上可能被感染的機器和它們的遠程主服務器進行獨立通信的機會

還有其他端口也一樣,像:31335、27444、27665、20034 NetBus、9704、137-139（smb）,2049(NFS)端口也應被禁止,我在這寫的也不全,有興趣的朋友應該去查一下相關資料.

當然出入更安全的考慮你也可以包OUTPUT鏈設置成DROP,那你添加的規則就多一些,就像上邊添加

允許SSH登陸一樣.照著寫就行了.

下面寫一下更加細致的規則,就是限制到某臺機器

如:我們只允許192.168.0.3的機器進行SSH連接

[root@tp ~]# iptables -A INPUT -s 192.168.0.3 -p tcp --dport 22 -j ACCEPT

如果要允許,或限制一段IP地址可用 192.168.0.0/24 表示192.168.0.1-255端的所有IP.

24表示子網掩碼數.但要記得把 /etc/sysconfig/iptables 里的這一行刪了.

-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT 因為它表示所有地址都可以登陸.

或采用命令方式:

[root@tp ~]# iptables -D INPUT -p tcp --dport 22 -j ACCEPT

然后保存,我再說一邊,反是采用命令的方式,只在當時生效,如果想要重起后也起作用,那就要保存.寫入到/etc/sysconfig/iptables文件里.

[root@tp ~]# /etc/rc.d/init.d/iptables save

這樣寫 !192.168.0.3 表示除了192.168.0.3的ip地址

其他的規則連接也一樣這么設置.

在下面就是FORWARD鏈,FORWARD鏈的默認規則是DROP,所以我們就寫需要ACCETP(通過)的鏈,對正在轉發鏈的監控.

開啟轉發功能,(在做NAT時,FORWARD默認規則是DROP時,必須做)

[root@tp ~]# iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT

[root@tp ~]# iptables -A FORWARD -i eth1 -o eh0 -j ACCEPT

丟棄壞的TCP包

[root@tp ~]#iptables -A FORWARD -p TCP ! --syn -m state --state NEW -j DROP

處理IP碎片數量,防止攻擊,允許每秒100個

[root@tp ~]#iptables -A FORWARD -f -m limit --limit 100/s --limit-burst 100 -j ACCEPT

設置ICMP包過濾,允許每秒1個包,限制觸發條件是10個包.

[root@tp ~]#iptables -A FORWARD -p icmp -m limit --limit 1/s --limit-burst 10 -j ACCEPT

我在前面只所以允許ICMP包通過,就是因為我在這里有限制.

二,配置一個NAT表放火墻

1,查看本機關于NAT的設置情況

[root@tp rc.d]# iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
SNAT all -- 192.168.0.0/24 anywhere to:211.101.46.235

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

我的NAT已經配置好了的(只是提供最簡單的代理上網功能,還沒有添加防火墻規則).關于怎么配置NAT,參考我的另一篇文章

當然你如果還沒有配置NAT的話,你也不用清除規則,因為NAT在默認情況下是什么都沒有的

如果你想清除,命令是

[root@tp ~]# iptables -F -t nat

[root@tp ~]# iptables -X -t nat

[root@tp ~]# iptables -Z -t nat

2,添加規則

添加基本的NAT地址轉換,(關于如何配置NAT可以看我的另一篇文章),

添加規則,我們只添加DROP鏈.因為默認鏈全是ACCEPT.

防止外網用內網IP欺騙

[root@tp sysconfig]# iptables -t nat -A PREROUTING -i eth0 -s 10.0.0.0/8 -j DROP
[root@tp sysconfig]# iptables -t nat -A PREROUTING -i eth0 -s 172.16.0.0/12 -j DROP
[root@tp sysconfig]# iptables -t nat -A PREROUTING -i eth0 -s 192.168.0.0/16 -j DROP
如果我們想,比如阻止MSN,QQ,BT等的話,需要找到它們所用的端口或者IP,(個人認為沒有太大必要)

例：

禁止與211.101.46.253的所有連接

[root@tp ~]# iptables -t nat -A PREROUTING -d 211.101.46.253 -j DROP

禁用FTP(21)端口

[root@tp ~]# iptables -t nat -A PREROUTING -p tcp --dport 21 -j DROP

這樣寫范圍太大了,我們可以更精確的定義.

[root@tp ~]# iptables -t nat -A PREROUTING -p tcp --dport 21 -d 211.101.46.253 -j DROP

這樣只禁用211.101.46.253地址的FTP連接,其他連接還可以.如web(80端口)連接.

按照我寫的,你只要找到QQ,MSN等其他軟件的IP地址,和端口,以及基于什么協議,只要照著寫就行了.

最后：

drop非法連接
[root@tp ~]# iptables -A INPUT -m state --state INVALID -j DROP
[root@tp ~]# iptables -A OUTPUT -m state --state INVALID -j DROP
[root@tp ~]# iptables-A FORWARD -m state --state INVALID -j DROP
允許所有已經建立的和相關的連接
[root@tp ~]# iptables-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
[root@tp ~]# iptables-A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

[root@tp ~]# /etc/rc.d/init.d/iptables save

這樣就可以寫到/etc/sysconfig/iptables文件里了.寫入后記得把防火墻重起一下,才能起作用．

[root@tp ~]# service iptables restart

別忘了保存，不行就寫一部保存一次．你可以一邊保存，一邊做實驗，看看是否達到你的要求，

上面的所有規則我都試過，沒有問題．

寫這篇文章，用了我將近１個月的時間．查找資料，自己做實驗，希望對大家有所幫助．如有不全及不完善的地方還請提出.

因為本篇文章以配置為主.關于IPTABLES的基礎知識及指令命令說明等我會盡快傳上,當然你可以去網上搜索一下,還是很多的.

posted @ 2013-12-21 22:37 不會飛的鳥閱讀(267) | 評論 (0) | 編輯收藏

echo——輸出顏色字體

名稱
       echo - 顯示一行文本
概要
       echo [選項]... [字符串]...
描述
       向標準輸出輸出字符串。
       -n     不輸出換行符
       -e     打開反斜杠ESC轉義
       -E     取消反斜杠ESC轉義 (默認)
       --help 顯示幫助
       --version 顯示版本
       \a     alert (BEL)
       \b     backspace
       \c     produce no further output
       \e     escape
       \f     form feed
       \n     new line
       \r     carriage return
       \t     horizontal tab
       \v     vertical tab
       \0NNN byte with octal value NNN (1 to 3 digits)
       \xHH   byte with hexadecimal value HH (1 to 2 digits)

-e 用來開啟echo中的轉義，\e 或 \033 來輸出<Esc>符號
設置顏色的格式： \e[背景色;前景色;高亮m輸出字符 \e[0m

$ echo -e '\033[31;44;1mThis is print\033[0m'
$ echo -e '\e[31;44;1mThis is print\e[0m'

背景色：0 透明（使用終端顏色）, 40 黑, 41 紅, 42 綠, 43 黃, 44 藍 45 紫, 46 青綠, 47白（灰）
前景色: 30 黑 31 紅, 32 綠, 33 黃, 34 藍, 35 紫, 36 青綠, 37 白（灰）
高亮度：高亮是1，不高亮是0。注意m后面緊跟字符串。

posted @ 2013-10-10 15:15 不會飛的鳥閱讀(373) | 評論 (0) | 編輯收藏

Linux下printf輸出字符串的顏色

printf("/033[字背景顏色;字體顏色m字符串/033[0m" );

printf("/033[41;32m字體背景是紅色，字是綠色/033[0m/n");
41是字背景顏色, 32是字體的顏色, 字體背景是紅色，字是綠色是字符串. 后面的/033[0m是控制碼.

顏色代碼:
QUOTE:
字背景顏色范圍: 40--49 字顏色: 30--39

                40: 黑                           30: 黑
                41: 紅                           31: 紅
                42: 綠                           32: 綠
                43: 黃                           33: 黃
                44: 藍                           34: 藍
                45: 紫                           35: 紫
                46: 深綠                         36: 深綠
                47: 白色                         37: 白色

ANSI控制碼:
QUOTE:
/033[0m   關閉所有屬性
/033[1m   設置高亮度
/03[4m   下劃線
/033[5m   閃爍
/033[7m   反顯
/033[8m   消隱
/033[30m   --   /033[37m   設置前景色
/033[40m   --   /033[47m   設置背景色
/033[nA   光標上移n行
/03[nB   光標下移n行
/033[nC   光標右移n行
/033[nD   光標左移n行
/033[y;xH設置光標位置
/033[2J   清屏
/033[K   清除從光標到行尾的內容
/033[s   保存光標位置
/033[u   恢復光標位置
/033[?25l   隱藏光標
/33[?25h   顯示光標

這樣, 在某些時候就可以實現動態的輸出.

如圖：

posted @ 2013-10-07 15:39 不會飛的鳥閱讀(2368) | 評論 (2) | 編輯收藏

一款簡單的正則表達式處理模塊([轉]Fast-regular-expressions)

原文出處：http://www.codeproject.com/Articles/798/Fast-regular-expressions

Fast regular expressions

By Martin Holzherr, 29 Oct 2000

Download source files - 10.9 Kb
Download demo project - 49.8 Kb

Sample Image - RexSearch.jpg

Introduction

Regular expressions are a well recognized way for describing string patterns. The following regular expression defines a floating point number with a (possibly empty) integer part, a non empty fractional part and an optional exponent:

Collapse | Copy Code

[0-9]* \.[0-9]+ ([Ee](\+|-)?[0-9]+)?

The rules for interpreting and constructing such regular expressions are explained below. A regular expression parser takes a regular expression and a source string as arguments and returns the source position of the first match. Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category. Besides being quite fast, it also supports dictionaries of regular expressions. With the definitions $Int= [0-9], $Frac= \.[0-9]+ and $Exp= ([Ee](\+|-)?[0-9]+), the above regular expression for a floating point number can be abbreviated to $Int* $Frac $Exp?.

Interface

I separated algorithmic from interface issues. The files RexAlgorithm.h and RexAlgorithm.cpp implement the regular expression parser using only standard C++ (relying on STL), whereas the file RexInterface.h and RexInterface.cpp contain the interfaces for the end user. Currently there is only one interface, implemented in the class REXI_Search. Interfaces for replace functionality and for programming language scanners are planned for future releases.

Collapse | Copy Code

struct REXI_DefErr{
    enum{eNoErr,eErrInName,eErrInRegExp} eErrCode;
    string  strErrMsg;
    int     nErrOffset;
    };
    class REXI_Search : public REXI_Base
    {
    public:
    REXI_Search(char cEos='\0');
    REXI_DefErr
    AddRegDef   (string strName,string strRegExp);
    inline  REXI_DefErr
    SetRegexp  (string strRegExp);
    bool    MatchHere   (const char*& rpcszSrc, int& nMatchLen,bool& bEos);
    bool    Find        (const char*& rpcszSrc, int& nMatchLen,bool& bEos);
    private:
    bool    MatchHereImpl();
    int     m_nIdAnswer;
    };

Example usage

Collapse | Copy Code

int main(int argc, char* argv[])
    {
    const char szTestSrc[]= "3.1415 is the same as 31415e-4";
    const int ncOk= REXI_DefErr::eNoErr;
    REXI_Search rexs;
    REXI_DefErr err;
    err= rexs.AddRegDef("$Int","[0-9]+");  assert(err.eErrCode==ncOk);
    err= rexs.AddRegDef("$Frac","\\.[0-9]+"); assert(err.eErrCode==ncOk);
    err= rexs.AddRegDef("$Exp","([Ee](\\+|-)?[0-9]+)");
    assert(err.eErrCode==ncOk);
    err= rexs.SetRegexp("($Int? $Frac $Exp?|$Int \\. $Exp?|$Int $Exp)[fFlL]?");
    assert(err.eErrCode==ncOk);
    const char*     pCur= szTestSrc;
    int             nMatchLen;
    bool            bEosFound= false;
    cout    <<  "Source text is: \""    <<  szTestSrc   << "\"" <<  endl;
    while(rexs.Find(pCur,nMatchLen,bEosFound)){
    cout <<  "Floating point number found  at position "
    <<  ((pCur-szTestSrc)-nMatchLen)
    <<  " having length "  <<  nMatchLen  <<  endl;
    }
    int i;
    cin >> i;
    return 0;
    }

Performance issues

A call to the member function REXI_Search::SetRegexp(strRegExp)involves quite a lot of computing. The regular expression strRegExp is analyzed and after several steps transformed into a compiled form. Because of this preprocessing work, which is not needed in the case of an interpreting regular expression parser, this regular expression parser shows its efficiency only when you apply it to large input strings or if you are searching again and again for the same regular expression. A typical application which profits from the preprocessing needed by this parser is a utility which searches all files in a directory.

Limitations

Currently Unicode is not supported. There is no fundamental reason for this limitation and I think that a later release will correct this. I just did not yet find an efficient representation of a compiled regular expression which supports Unicode.

Constructing regular expressions

Regular expressions can be built from characters and special symbols. There are some similarities between regular expressions and arithmetic expressions. The most basic elements of arithmetic expressions are numbers and expressions enclosed in parens ( ). The most basic elements of regular expressions are characters, regular expressions enclosed in parens ( ) and character sets. On the next higher level, arithmetic expressions have '*' and '/' operators, whereas regular expressions have operators indicating the multiplicity of the preceding element.

Most basic elements of regular expressions

Individual characters. e.g. "h" is a regular expression. In the string "this home" it matches the beginning of 'home'. For non printable characters, one has to use either the notation \xhh where h means a hexadecimal digit or one of the escape sequences \n \r \t \v known from "C". Because the characters * + ? . | [ ] ( ) - $ ^ have a special meaning in regular expressions, escape sequences must also be used to specify these characters literally: \* \+ \? \. \| \[ \]  \- \$ \^ . Furthermore, use '\ ' to indicate a space, because this implementation skips spaces in order to support a more readable style.
Character sets enclosed in square brackets [ ]. e.g. "[A-Za-z_$]" matches any alphabetic character, the underscore and the dollar sign (the dash (-) indicates a range), e.g. [A-Za-z$_] matches "B", "b", "_", "$" and so on. A ^ immediately following the [ of a character set means 'form the inverse character set'. e.g. "[^0-9A-Za-z]" matches non-alphanumeric characters.
Expressions enclosed in round parens ( ). Any regular expression can be used on the lowest level by enclosing it in round brackets.
the dot . It means 'match any character'.
an identifier prefixed by a $. It refers to an already defined regular expression. e.g. "$Ident" stands for a user defined regular expression previously defined. Think of it as a regular expression enclosed in round parens, which has a name.

Operators indicating the multiplicity of the preceding element

Any of the above five basic regular expressions can be followed by one of the special characters * + ? /i

* meaning repetition (possibly zero times); e.g. "[0-9]*" not only matches "8" but also "87576" and even the empty string "".
+ meaning at least one occurrence; e.g. "[0-9]+" matches "8", "9185278", but not the empty string.
? meaning at most one occurrence; e.g. "[$_A-Z]?" matches "_", "U", "$", .. and ""
\i meaning ignore case

Catenation of regular expressions

The regular expressions described above can be catenated to form longer regular expressions. E.g. "[_A-Za-z][_A-Za-z0-9]*" is a regular expression which matches any identifier of the programming language "C", namely the first character must be alphabetic or an underscore and the following characters must be alphanumeric or an underscore. "[0-9]*\.[0-9]+" describes a floating point number with an arbitrary number of digits before the decimal point and at least one digit following the decimal point. (The decimal point must be preceded by a backslash, otherwise the dot would mean 'accept any character at this place'). "(Hallo (,how are you\?)?)\i" matches "Hallo" as well as "Hallo, how are you?" in a case insensitive way.

Alternative regular expressions

Finally - on the top level - regular expressions can be separated by the | character. The two regular expressions on the left and right side of the | are alternatives, meaning that either the left expression or the right expression should match the source text. E.g. "[0-9]+ | [A-Za-z_][A-Za-z_0-9]*" matches either an integer or a "C"-identifier.

A complex example

The programming language "C" defines a floating point constant in the following way: A floating point constant has the following parts: An integer part, a decimal point, a fraction, an exponential part beginning with e or E followed by an optional sign and digits and an optional type suffix formed by one the characters f, F, l, L. Either the integer part or the fractional part can be absent (but not both). Either the decimal point or the exponential part can be absent (but not both).

The corresponding regular expression is quite complex, but it can be simplified by using the following definitions:

Collapse | Copy Code

$Int = "[0-9]+."
    $Frac= "\.[0-9]+".
    $Exp = "([Ee](\+|-)?[0-9]+)".

So we get the following expression for a floating point constant:

Collapse | Copy Code

($Int? $Frac $Exp?|$Int \. $Exp?|$Int $Exp)[fFlL]?

posted @ 2013-01-08 19:45 不會飛的鳥閱讀(399) | 評論 (0) | 編輯收藏

[轉]linux下c/c++方式訪問curl的幫助手冊

摘要: 有個業務需求需要通過curl 代理的方式來訪問外網百度了一把，測試可以正常使用。記錄下來方便后續查找 example: 1. http://curl.haxx.se/libcurl/c/example.html 2. http://www.libcurl.org/book: 1. http://www.linuxdevcente... 閱讀全文

posted @ 2012-11-12 23:51 不會飛的鳥閱讀(7890) | 評論 (0) | 編輯收藏

[轉]支持V4協議飛信機器人發布（20101205002服務/命令模式均支持）

支持V4協議飛信機器人發布（20101205002服務/命令模式均支持）

看到論壇上有人質疑飛信機器人彈出圖形驗證碼的行為是為了賺錢，見該貼：http://bbs.it-adv.net/viewthread.php?tid=1096（請飛信機器人學學360啦）

在此聲明，彈出圖形驗證碼是官方飛信服務器需要的，不是我程序的行為。不信的話，可以自己去抓包分析。請不要在論壇上抗議圖形碼的事情，要抗議，打10086去抗議。飛信機器人命令行版本永久免費使用。

移動發布公告（http://feixin.10086.cn/bulletin/2521/1）：11月20日之后，將停止3.5及以下版本飛信的支持。特此發布新版支持V4協議的機器人程序，請使用原09版機器人的朋友迅速測試并升級。

新版飛信機器人版本號起于 20101113002，凡之前版本的飛信機器人，11月20日之后將不能使用。

1. 遇到輸入圖形驗證碼時，自動生成圖形驗證碼，用戶可以手工輸入識別后的驗證碼（解決之前的422問題，識別及輸入方法見Q&A）。
2. 可以設定遇到圖形驗證時的行為，退出（ --exit-on-verifycode=1）或者手工輸入（當機器人后臺執行時，如果等待輸入將導致程序無限期等待）
3. 第一次運行時將緩存配置數據（文件名：登錄賬號.cache），之后運行自動加載緩存，提高發送速度。
4. 集成推立方（http://www.tui3.com）短信發送協議，發送到聯通、電信手機或遇飛信服務器不可用時，可以直接使用本客戶端進行發送（注：該服務為收費服務，詳細情況請見：http://www.tui3.com/page/tuixin ）

和下載以往版本一樣，先回復后下載噢。

>> 安裝 <<
本程序為綠色程序，無需安裝，下載解壓后即可使用。
1. 根據您的需求，分別下載附件中的windows或者linux版本的機器人主程序（回貼后才能下載噢，請別嫌麻煩），解壓（注意：主程序所在目錄的路徑中不要存在空格，如 c:\Program files\... 這樣的目錄)
下載：

以下內容需要回復才能看到

windows版本:

fetion.rar (156.81 KB) linux版本:

fetion (491.63 KB)

  2. 下載機器人支持庫（使用機器人以前版本的朋友可以直接忽略），把壓縮包中的文件解壓到主程序相同的目錄
windows 版本：http://www.it-adv.net/fetion/win32dll_20101113.rar
linux版本：http://www.it-adv.net/fetion/linuxso_20101113.rar （在Redhat4 下編譯，其它LINUX發行版的朋友請測試）
linux64位(centos5.4)版本: http://www.it-adv.net/fetion/cenos54X64_20101113.rar (感謝QQ“走過你的風”網友提供。Centos5.4 64位linux系統下，如果用上面提供的linux版本動態庫，會提示Segmentation fault，程序異常退出)

（為何分開下載？因為機器人主程序經常更新，而支持庫不會更新）
注意：linux用戶，請不要把支持庫中的 lib* 復制到 /usr/lib 下，因為發行版本不同，可能會覆蓋您機器中的核心庫，導致嚴重系統問題。您可以把庫解壓到主程序的相同目錄，然后以 LD_LIBRARY_PATH=. ./fetion 來運行）

>> 使用說明 <<

   以下參數提供登錄用的賬號密碼（三種方式，手機號-密碼飛信號-密碼文件--索引）

--mobile=[手機號]    登錄手機號
--sid=[飛信號]       登錄飛信號
--pwd=[密碼]       登錄密碼
--config=[文件名]    存儲手機號、密碼的文件。
--index=[索引號]    索引

以下參數提供接收者
--to=[手機號/飛信號/URI] 接收消息的手機號/飛信號/URI.如果知道對方URI，則只需自己在對方好友列表，無需對方在自己好友列表就能發送。
         支持多個號碼，中間用,逗號分隔
--msg-utf8=[信息]
   發送的消息，UTF8編碼
--msg-gb=[信息]
   發送的消息，GB編碼
--file-utf8=[文件utf8格式]
   發送文件內容
--file-gb=[文件gb格式]
   發送文件內容
--msg-type=[0/1/2]
   發送消息類型：普通消息長消息智能短信

   小工具
   --query-cmcc-no  查詢移動公司手機段

   以下為可選項
--debug
顯示調試信息
--hide
隱身登錄
   --exit-on-verifycode
服務器需要進行圖形驗證時，程序退出（1）或者等待用戶手工輸入識別信息（程序默認）

--proxy-ip=http代理ip
--proxy-port=http代理端口
(機器人需要  HTTP CONNECT代理，大家常用的 ccproxy 是支持的)

>> 舉例 <<

  以下為 windows 下舉例:
fetion --mobile=13711123456 --pwd=mypwd --to=137xxxxxxxx --msg-gb=測試
fetion --sid=6630321 --pwd=mypwd --to=137xxxxxxxx --msg-gb=測試
fetion --config=sample.conf --index=1 --to=137xxxxxxxx --msg-gb=測試

  linux下，請使用如下命令：
LD_LIBRARY_PATH=. ./fetion --mobile=13711123456 --pwd=mypwd --to=137xxxxxxxx --msg-utf8=測試
LD_LIBRARY_PATH=. ./fetion --sid=6630321 --pwd=mypwd --to=137xxxxxxxx --msg-utf8=測試
LD_LIBRARY_PATH=. ./fetion --config=sample.conf --index=1 --to=137xxxxxxxx --msg-utf8=測試


發送消息中如果需要換行，請用 \n

  // 以下為 sample.conf 內容，文件內容中， #號為注釋行

  # This config file is for fetion robot tool.
# Usage demo: ./fetion --config=/etc/fetion.conf --index=1
# ID Mobile  Password
1  137xxxx  1234234

使用推立方收費短信服務：
  fetion --mobile=接收人手機 --t3key=推立方APIKEY --msg-gb=gbk編碼的發送內容（或者 --msg-utf8=utf8格式的內容）
  推立方APIKEY：到推立方官方網站(http://www.tui3.com/)注冊會員（注冊成功后，贈送10條短信），進行產品配置，則可以獲取該KEY。

新版提示：
1. 使用復雜的密碼（數字+字母+符號），將不會彈出圖形驗證碼(我測試時是這種表現)。
2. 第一次使用機器人時，即使是復雜密碼，也可能會彈出圖形驗證碼，以后將不再提示。
3. 如果您是使用其它程序調用本程序，請在運行時，指定--exit-on-verifycode=1，否則，程序將無限期等待。當您以該參數運行機器人時，當彈出圖形驗證碼時，程序將以退出碼29結束( 在 linux 中，通過$?獲取，在 windows中，通過 %ERRORLEVEL% 獲取）
4.  請確保目錄權限可寫

補充Q&A：
1.WIN2003不能使用：感謝32樓lvjinhua提供的解決辦法“win2003不能用的問題，安裝 vs2008 sp1的vcredist_x86.exe(http://www.microsoft.com/downloads/en/confirmation.aspx?familyid=a5c84275-3b97-4ab7-a40d-3802b2af5fc2&displaylang=en) 就好了！”
2.運行時顯示的中文亂碼：程序運行時，LINUX環境以UTF8編碼輸出，WINDOWS環境以GBK編碼輸出，請注意您使用控制臺的編碼方式。另外，即使中文亂碼也不影響您的使用，那句話的提示就是讓您打開圖形文件，輸入圖片識別碼。
3.WINDOWS下密碼中如果有特殊字符如何輸入，比如&|: 請用 ^ 進行轉義，如密碼中有 &,請輸入 ^&
4.輸過一次驗證碼以后，以后還會再要求輸入嗎？目前的表現是這樣（如過一次，再登錄就不會輸入），但不保證以后移動飛信服務器修改驗證規則以后，會不會再強制你輸入（比如飛信服務器認為你的賬號有異常，或者你的賬號頻繁登錄、頻繁發短信）。
5.如何輸入圖片驗證碼?因為飛信機器人是控制臺程序，無法顯示圖片，所以，請把生成的圖片用查看圖片的辦法打開后進行識別。如果您的環境是在linux下，并且沒有X環境，那么您可以把圖片下載到WINDOWS機器中查看。人工識別后，把識別后的內容輸入即可。
6.494錯誤：發送U到12520，就可以解除受限。

更新日志：
20101205002：支持服務模式
20101115005：FIXBUG：某些環境下，不能獲取圖形驗證碼，提示：getpiccodev4 return error xml（感謝網友QQ五斗米的協助）
20101113002：支持飛信最新V4協議初始版本

服務模式開發使用指南

在服務模式下，飛信機器人將長期在線，可以用來構造交互性的機器人應用。具體演示可以加藍色理想網站飛信機器人：806908614。

服務模式運行方法：
fetion --mobile=手機號 --pwd=密碼
fetion --sid=飛信號 --pwd=密碼

服務模式支持電子郵箱注冊的飛信號碼

服務模式開發使用資料導航：

1.飛信機器人服務版配置視頻教程：http://bbs.it-adv.net/viewthread.php?tid=188&extra=page%3D1
（該視頻教程為windows版本，linux版本和此類似）
2. 飛信機器人框架配置指南：http://bbs.it-adv.net/viewthread.php?tid=174&extra=page%3D1
3. 自帶演示框架數據庫說明及操作指南（PDF）：http://bbs.it-adv.net/viewthread.php?tid=172&extra=page%3D1
4. 插件原理：http://bbs.it-adv.net/viewthread.php?tid=28&extra=page%3D1
5. 機器人PHP框架及數據庫SQL文件：http://www.it-adv.net/fetion/downng/plugins_sql.rar
6. 控制指令集：http://wiki.blueidea.com/index.php?title=%E9%A3%9E%E4%BF%A1%E6%9C%BA%E5%99%A8%E4%BA%BA/%E6%8E%A7%E5%88%B6%E6%8C%87%E4%BB%A4%E9%9B%86
7. 事件插件：http://wiki.blueidea.com/index.php?title=%E9%A3%9E%E4%BF%A1%E6%9C%BA%E5%99%A8%E4%BA%BA/%E4%BA%8B%E4%BB%B6%E6%8F%92%E4%BB%B6%E5%BC%80%E5%8F%91%E8%AF%B4%E6%98%8E
8.藍色理想飛信機器人WIKI：http://wiki.blueidea.com/index.php?title=%E9%A3%9E%E4%BF%A1%E6%9C%BA%E5%99%A8%E4%BA%BA

服務版本未授權的用戶使用時，有如下限制：
1、不支持加好友請求事件（handle_contact_request）
2、不支持刪除指令（buddy-delete）
3、不支持獲取好友信息指令（contact-info）
4、發送消息時，后面有網站信息

新版變化：
1、新版插件第一個參數傳遞的飛信號碼（以前傳遞的是手機號,由此帶來的問題是：如果還是用原來的框架，那么生成的cmd文件，前面是飛信號。但是飛信機器人主程序認的命令文件是手機號_id.cmd, 所以，請修改相應代碼，把飛信號換成手機號）
2、handle_contact_request，傳遞的userid(之前是uri)
3、buddy-delete 使用 userid
4、accept_contact_request: 使用 userid
5、buddy-data: 新增加一個字段：carrier-region，例如：CN.bj.10.

升級注意事項：
1、以前使用飛信機器人服務版框架的朋友進行升級時一定注意：因插件的第一個參數由以前的手機號改成了飛信號， plugin_buddy_data 中一段代碼需要刪除，否則會造成好友等幾個數據表清空。
2、V4協議中，用戶所屬城市信息由以前的省+市改成了 carrier-region ，plugin_buddy_data 中的 province 和 city 不在有效

posted @ 2012-11-10 10:36 不會飛的鳥閱讀(1153) | 評論 (5) | 編輯收藏

【轉】STL_string的字符串替換函數

標準C++中的string中的函數不多，沒有CString的功能強大，但是如果想在Unicode編碼下使用多字節，就不能使用CString，于是自己寫了一個類似于CString的Replace函數。
string replace( const string& inStr, const char* pSrc, const char* pReplace )
{
     string str = inStr;
    string::size_type stStart = 0;
    string::iterator iter = str.begin();
    while( iter != str.end() )
    {
        // 從指定位置查找下一個要替換的字符串的起始位置。
        string::size_type st = str.find( pSrc, stStart );
        if ( st == str.npos )
        {
            break;
        }
        iter = iter + st - stStart;
        // 將目標字符串全部替換。
        str.replace( iter, iter + strlen( pSrc ), pReplace );
        iter = iter + strlen( pReplace );
        // 替換的字符串下一個字符的位置
        stStart = st + strlen( pReplace );
    }
    return str;
}

上述方法在執行replace( "h h h h h h h h h h h h h h h h h h h ", " ", " " )時出現問題。
下面再列出一種方法：
string CComFunc::replace( const string& inStr, const char* pSrc, const char* pReplace )
{
    string strSrc = inStr;
    string::size_type pos=0;
    string::size_type srclen = strlen( pSrc );
    string::size_type dstlen = strlen( pReplace );
    while( (pos=strSrc.find(pSrc, pos)) != string::npos)
    {
        strSrc.replace(pos, srclen, pReplace);
        pos += dstlen;
    }
    return strSrc;
}

補充，經過測試，上面方法再執行，replace（ “暴”， "\\","==" ）時，依然會遇到問題。
在日文系統上，因為“暴”占兩個字節，而"\\"只占一個字節，但與“暴”的低位字節ASCII碼相同。
而string的Find函數，是按照字節比較的，所以，將這個字節替換了，導致文本替換出現問題。
于是考慮到不應該按字節比較，應該按字符比較，測試發現，CString的替換函數沒有問題，于是考慮按照CString的方法重新寫一個replace函數。
代碼如下：
因為CString在_MBCS和_UNICODE下是變寬的，而我寫的replace函數，只針對string。
string CComFunc::replace( const string& inStr, const char* pSrc, const char* pReplace )
{
    string strSrc = inStr;
    LPSTR lpch = ( CHAR* )strSrc.c_str();
    int   nOldLength = strlen( lpch );
    int    nSourceLen = strlen(pSrc);
    if (nSourceLen == 0)
    {
        return lpch;
    }
    int   nReplacementLen = strlen(pReplace);
    LPSTR lpszStart = lpch;
    LPSTR lpszEnd = lpszStart + nOldLength;
    LPSTR lpszTarget;

    // 先列出判斷替換字符是否存在的方法, 但在此函數中不使用這段代碼。
/*
    // judge whether exist
    while (lpszStart < lpszEnd)
    {
        while ((lpszTarget = (CHAR*)_mbsstr(( const unsigned char * )lpszStart, ( const unsigned char * )pSrc)) != NULL)
        {
            nCount++;
            lpszStart = lpszTarget + nSourceLen;
        }
        lpszStart += strStart.length() + 1;
    }
    *//

    // 下面是替換的代碼。
    while (lpszStart < lpszEnd)
    {
        while ((lpszTarget = (CHAR*)_mbsstr(( const unsigned char * )lpszStart, ( const unsigned char * )pSrc)) != NULL)
        {
            int nBalance = nOldLength - (lpszTarget - lpch + nSourceLen);
            memmove(lpszTarget + nReplacementLen, lpszTarget + nSourceLen,
                nBalance * sizeof(CHAR));
            memcpy(lpszTarget, pReplace, nReplacementLen*sizeof(CHAR));
            lpszStart = lpszTarget + nReplacementLen;
            lpszStart[nBalance] = '\0';
            nOldLength += (nReplacementLen - nSourceLen);
        }
        lpszStart += strlen(lpszStart) + 1;
    }
    return lpch;
}

此方法最關鍵的是_mbsstr函數，在"MBSTRING.H"頭文件中聲明。

posted @ 2012-10-26 16:20 不會飛的鳥閱讀(1482) | 評論 (0) | 編輯收藏

僅列出標題

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

不會飛的鳥

導航

統計

常用鏈接

留言簿(4)

隨筆檔案

文章檔案

搜索

最新評論

閱讀排行榜

評論排行榜

[轉]TF-IDF與余弦相似性的應用（二）：找出相似文章

[轉]TF-IDF與余弦相似性的應用（一）：自動提取關鍵詞

關于pthread_detach( )

linux下IPTABLES配置詳解

echo——輸出顏色字體

Linux下printf輸出字符串的顏色

一款簡單的正則表達式處理模塊([轉]Fast-regular-expressions)

Fast regular expressions

Introduction

Interface

Example usage

Performance issues

Limitations

Constructing regular expressions

Most basic elements of regular expressions

Operators indicating the multiplicity of the preceding element

Catenation of regular expressions

Alternative regular expressions

A complex example

[轉]linux下c/c++方式訪問curl的幫助手冊

[轉]支持V4協議飛信機器人發布（20101205002服務/命令模式均支持）

支持V4協議飛信機器人發布（20101205002服務/命令模式均支持）

服務模式開發使用指南

【轉】STL_string的字符串替換函數