啥也不說,先看看一段來自
LingosHook的代碼先~
int CHtmlDictParser::HtmlDataType1Proc(const std::wstring &html, const std::wstring &dictid, const TinyHtmlParser::CDocumentObject &doc, const TinyHtmlParser::CElementObject *dict, const TinyHtmlParser::CElementObject *pdiv, const HtmlDictParser::TDictResult &res, TResultMap &result) const


{
const TinyHtmlParser::CElementObject *p = pdiv->child;
if(p == NULL)
return -1;
p = p->child;

while(p != NULL)

{
if(p->child == NULL || p->child->child == NULL || p->child->child->sibling == NULL
|| p->child->child->sibling->child == NULL || p->child->child->sibling->child->child == NULL
|| p->child->child->sibling->child->child->child == NULL)
return 0;
std::wstring word = p->child->child->sibling->child->child->child->value;
if(PushResult(word, res, result) != 0)
return -1;

if(p->sibling == NULL || p->sibling->sibling == NULL)
break;
p = p->sibling->sibling;
}
return 0;
}
此函數用于分解出下面HTML數據中的單詞,只是其中那段if語句是否讓你感到眼暈?
<id="dict_body_7AB175CC5F622A44A0DECE976AF22A16">
<div id="dict_gls_7AB175CC5F622A44A0DECE976AF22A16">
<div style="MARGIN: 5px 0px">
<div style="WIDTH: 100%">
<div style="FLOAT: left; LINE-HEIGHT: normal">
<img height="11" src=
"file:///C:/Program%20Files/Lingoes/Translator2.7/dict/image/entry_p.png"
width="10" align="absmiddle" border="0">
</div>
<div style="OVERFLOW-X: hidden; WIDTH: 100%">
<div style=
"MARGIN: 0px 0px 5px; COLOR: #808080; LINE-HEIGHT: normal">
<span style=
"FONT-SIZE: 10.5pt; COLOR: #000000; LINE-HEIGHT: normal"><b>
AC</b></span>
</div>
<div style="MARGIN: 0px 0px 5px">
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
公元前
</div>
</div>
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[計]</font> 存取周期,
累加器, 聲耦合器, 交流, 應用控制,
自動檢查, 自動計算機
</div>
</div>
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[化]</font> 交流; 交變電流
</div>
</div>
</div>
</div>
</div>
</div>
<div style=
"PADDING-RIGHT: 0px; BORDER-TOP: #c7d4dc 1px solid; PADDING-LEFT: 0px; PADDING-BOTTOM: 0px; PADDING-TOP: 5px">
</div>
<div style="MARGIN: 5px 0px">
<div style="WIDTH: 100%">
<div style="FLOAT: left; LINE-HEIGHT: normal">
<img height="11" src=
"file:///C:/Program%20Files/Lingoes/Translator2.7/dict/image/entry_p.png"
width="10" align="absmiddle" border="0">
</div>
<div style="OVERFLOW-X: hidden; WIDTH: 100%">
<div style=
"MARGIN: 0px 0px 5px; COLOR: #808080; LINE-HEIGHT: normal">
<span style=
"FONT-SIZE: 10.5pt; COLOR: #000000; LINE-HEIGHT: normal"><b>
Ac.</b></span>
</div>
<div style="MARGIN: 0px 0px 5px">
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[醫]</font> 錒(89號元素)
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
Lingoes的結果數據看似很有規律,實際內部有非常微小的差異,為提高LingosHook識別能力,不得不需要非常仔細地分析這些數據,以找出其規律。這個分析過程,讓我想起去年
破解WOW的MPQ文件時的經歷,痛苦啊,有興趣,可查看
這里的貼圖~
目前測試的詞典多數可以歸為兩類,以分別寫了相應的函數進行處理,根據“解密”結果優化HTML處理過程,盡量做到快速和通用,再檢測幾個詞典結果,過兩天應該可以更新了。唉,累死我了,還好這幾天工作上沒有“情況”發生,“解密”是需要消耗大把時間的~