灝辨嬁瑙f瀽html鏉ヨ錛屼嬌鐢?Beautiful Soup 灝辨瘮浣跨敤libtidy鏂逛究寰堝 - 褰撶劧涔熸湁鍙兘鏄?div>Beautiful Soup灝佽鐨勫緢鍘夊鍚?br />
浣跨敤Beautiful Soup鐨勪竴涓緥瀛愬涓?
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('List.htm'))
for a in soup.find_all('a',class_ = 'link'):
print (a.get('href'))
鐩殑鏄壘鍑篽tml涓璫lass灞炴т負link鐨刟鑺傜偣瀵瑰簲鐨刪ref灞炴у瓧絎︿覆
濡傛灉鏄嬌鐢╟++ libtidy鐨勮瘽
瀵瑰簲鐨勪唬鐮佸涓?
Bool TIDY_CALL tidyFilterCb(TidyDoc tdoc,TidyReportLevel lvl,uint line,uint col,ctmbstr mssg)
{
return no;
}
void extractContent(TidyNode node,TidyDoc doc);
void parseContent(TidyNode node,TidyDoc doc)
{
TidyNode child;
for(child = tidyGetChild(node);child;child = tidyGetNext(child))
{
if(tidyNodeIsA(child))
extractContent(child,doc);
else
parseContent(child,doc);
}
}
void extractContent(TidyNode node,TidyDoc doc)
{
if(yes == tidyNodeIsA(node))
{
TidyAttr cls = tidyAttrGetCLASS(node);
if(cls != NULL)
{
char* value = (char*)tidyAttrValue(cls);
if(!strcmp(value,"link"))
{
TidyAttr href = tidyAttrGetHREF(node);
if(href != NULL)
{
char* link = (char*)tidyAttrValue(href);
printf("link:%s\n",link);
return;
}
}
}
}
parseContent(node,doc);
}
void tidyParseHtml(char* file)
{
TidyDoc doc = tidyCreate();
tidySetReportFilter(doc,tidyFilterCb);
tidyParseFile(doc,file);
TidyNode body = tidyGetBody(doc);
TidyNode child;
for(child = tidyGetChild(body);child;child = tidyGetNext(child))
{
parseContent(child,doc);
}
tidyRelease(doc);
}
榪樻槸寰堝暟鍡︾殑
褰撶劧涓嬮潰鐨刾ython浠g爜涔熻兘瀹屾垚浠誨姟:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('List.htm'))
list = soup.select('a[class="link"]')
for a in list:
if a.has_attr('href'):
print (a.get('href'))
濡傛灉鎯沖垎鏋愮綉欏墊垜瑙夊緱BeatifulSoup緇濆鏄竴涓埄鍣?br />閾炬帴:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

]]>