《編程之美》讀書(shū)筆記24: 3.5 最短摘要的生成
當(dāng)初看這道題時(shí),看了好了幾遍都沒(méi)看懂。后來(lái)總算弄明白:給出的字符串是用其它程序分好詞的,關(guān)鍵字符串也是用其它程序分好詞的,而不是按用戶(hù)直接輸入的字符串。比如書(shū)上給的例子:“微軟亞洲研究院 使命”,不是按空格分成兩個(gè)關(guān)鍵詞,“微軟亞洲研究院”和“使命”,而是按其它程序分成:“微軟”、“亞洲”、“研究院”和“使命”四個(gè)關(guān)鍵詞。
“最短摘要”應(yīng)該是指:包含所有關(guān)鍵字(關(guān)鍵字不要求按用戶(hù)輸入的順序排列)的長(zhǎng)度最短的摘要。書(shū)上的解法,把“最短摘要”理解成包含所有關(guān)鍵字且詞個(gè)數(shù)最少的摘要。
弄清了問(wèn)題,解決起來(lái)就很簡(jiǎn)單:
1 反復(fù)讀入字符串,直到碰到關(guān)鍵字(可以用set或unordered_set)。
2 更新該關(guān)鍵字字符串最近出現(xiàn)的位置。
3 若已經(jīng)找到所有的關(guān)鍵字,根據(jù)這些關(guān)鍵字的位置最小/最大值,計(jì)算摘要長(zhǎng)度
可以用set來(lái)維護(hù)這些位置值。
(實(shí)際上,只要求維護(hù)位置的最小值,還可以自行實(shí)現(xiàn)一個(gè)堆結(jié)構(gòu),節(jié)省空間。)
根據(jù)位置值計(jì)算長(zhǎng)度,需要先計(jì)算出分詞后的字符串,在未分詞的字符串的位置。
4 記錄長(zhǎng)度最短的摘要
若有m個(gè)關(guān)鍵字,待查詢(xún)字符串有n個(gè),時(shí)間復(fù)雜度大概為:O(n*log m)
(關(guān)鍵字一般都很短,可以認(rèn)為對(duì)關(guān)鍵字間的比較、計(jì)算哈希值時(shí)間復(fù)雜度為O(1))
另外,將關(guān)鍵字映射到數(shù)字,減少字符串比較,能進(jìn)一步提高效率。

最短摘要
#include<iostream>
#include<vector>
#include<string>
#include<set>

#define USE_CPLUSPLUS0x 1

#if USE_CPLUSPLUS0x
#include<unordered_map>
#else
#include<map>
#endif

#include<cassert>

using std::string;
using std::vector;
using std::cout;


/**//* //----
#if USE_CPLUSPLUS0x
struct Cmp{
bool operator()(const string* a, const string* b) const { return *a == *b; }
} ;
struct Hash{
size_t operator()(const string* a) const { return std::hash<string>()(*a);}
} ;
typedef std::unordered_map<const string*, size_t, Hash, Cmp> Dict;
#else
struct Cmp{
bool operator()(const string* sa, const string * sb) const { return *sa < *sb; }
} ;

typedef std::map<const string*, size_t, Cmp> Dict;
#endif
*/
#if USE_CPLUSPLUS0x
typedef std::unordered_map<string, size_t> Dict;
#else
typedef std::map<string, size_t> Dict;
#endif


template<typename T>
void output(T beg, T end, const char str[] = "", const char sep[] = "")


{
cout << str;
while (beg != end) cout << *beg++ << sep;
cout << "\n";
}

void solve(const string str[], const size_t str_sz,
const string keyword[], const size_t keyword_sz)


{
if (str_sz == 0 || keyword_sz == 0) return;
Dict key_idx; //關(guān)鍵字映射為數(shù)字,以減少字符串比較

for (size_t i = 0; i < keyword_sz; ++i)
{
//key_idx.insert(Dict::value_type(&keyword[i], i)); //----
key_idx.insert(Dict::value_type(keyword[i], i));
}
const size_t Pos_init = -1;
vector<size_t> prev_pos(keyword_sz, Pos_init); //上次碰到的關(guān)鍵字符串的位置
std::set<size_t> pos; //對(duì)關(guān)鍵字符串的位置進(jìn)行升序排列
vector<size_t> old_pos; //分詞后的字符串,在原字符串中的位置
old_pos.reserve(str_sz + 1);
old_pos.push_back(0);

for (size_t i = 0, sum = 0; i < str_sz; ++i)
{
sum += str[i].size();
old_pos.push_back(sum);
}
size_t beg = 0, end = 0, len = -1, count = 0; //記錄結(jié)果

for (size_t i = 0; i < str_sz; ++i)
{
//Dict::iterator it = key_idx.find(&str[i]); //----
Dict::iterator it = key_idx.find(str[i]);

if (it == key_idx.end()) continue;
assert(it->second < prev_pos.size());
size_t& last_pos = prev_pos[it->second];
if (last_pos != Pos_init) pos.erase(last_pos); //更新前要先刪除舊的
else ++count; //已匹配關(guān)鍵字 數(shù)加1
last_pos = i;
pos.insert(pos.end(), i);
assert(count <= keyword_sz);

if (count == keyword_sz)
{ //包含所有關(guān)鍵字
size_t pbeg = *pos.begin();
//std::set<size_t>::iterator it = pos.end();
//size_t pend = *--it + 1;
size_t pend = i + 1;
assert(pbeg < old_pos.size());
assert(pend < old_pos.size());
size_t plen = old_pos[pend] - old_pos[pbeg];

if (plen < len)
{
len = plen;
beg = pbeg;
end = pend;
}
}
}
output(&str[0], &str[str_sz], "words: ");
output(&keyword[0], &keyword[keyword_sz], "keywords: ", " ");

if (beg == end && !pos.empty())
{ //沒(méi)找到所有關(guān)鍵字
cout << "(" << pos.size() << "/" << keyword_sz << " Matched) ";
beg = *pos.begin();
std::set<size_t>::iterator it = pos.end();
end = *--it + 1;
}
output(&str[beg], &str[end], "result: ");
cout << "\n";
}


template<typename T, size_t sz>

inline size_t getsz(T (&)[sz])
{ return sz; }

int main()


{

string keyword[] =
{ "微軟", "計(jì)算機(jī)", "亞洲", "中國(guó)"};

string str[] =
{
"微軟","亞洲","研究院","成立","于","1998","年",",","我們","的","使命",
"是","使","未來(lái)","的","計(jì)算機(jī)","能夠","看","、","聽(tīng)","、","學(xué)",",",
"能","用","自然語(yǔ)言","與","人類(lèi)","進(jìn)行","交流","。","在","此","基礎(chǔ)","上",
",","微軟","亞洲","研究院","還","將","促進(jìn)","計(jì)算機(jī)","在","亞太","地區(qū)",
"的","普及",",","改善","亞太","用戶(hù)","的","計(jì)算","體驗(yàn)","。","”"
};

solve(str, getsz(str), keyword, getsz(keyword));
}
作者:
flyinghearts 出處:
http://www.cnblogs.com/flyinghearts/ 本文采用
知識(shí)共享署名-非商業(yè)性使用-相同方式共享 2.5 中國(guó)大陸許可協(xié)議進(jìn)行許可,歡迎轉(zhuǎn)載,但未經(jīng)作者同意必須保留此段聲明,且在文章頁(yè)面明顯位置給出原文連接,否則保留追究法律責(zé)任的權(quán)利。