??xml version="1.0" encoding="utf-8" standalone="yes"?>
wikip:(x)
wikipedia大家都不陌生Q它的下载地址是:(x)http://dumps.wikimedia.org/ , q里有详l介l:(x)http://en.wikipedia.org/wiki/Wikipedia:Database_download
但是wikipedia只是Wikimedia基金?x)的一个子目Qwikimedia下面q有多个其他的重要项目,包括Q?br />wiktionary 一?a title="语义? target="_blank">语义?/a>的关联词典,形式上类gwordnet
wikiquote 收录各种名h名言
Wikibooks 免费的教U书和手?br />Wikinews 大量的新L?br />Wikiversity 免费的教育材?br />Wikisource 免费的文本内?br />上述的这些内容,都可以通过http://dumps.wikimedia.org/ 下蝲到?br />q有一些小型的wiki目Q比如:(x)
http://simple.wikipedia.org 使用Basic English写的wikiQ给儿童和初学者看
http://simple.wiktionary.org 使用Basic English写的wiktionary
wikipedia的数据处理有很多方式Q我比较推崇q两个:(x)
jwpl: http://code.google.com/p/jwpl/
wikipedia-miner: http://wikipedia-miner.cms.waikato.ac.nz/wiki/
下面我介l下另一个商业化的wiki|站:http://www.wikia.com q个|站?a title="用户" target="_blank">用户可以创徏单独的维基网站,下面是排名前250位wikia|站Q?br />http://wikis.wikia.com/wiki/List_of_Wikia_wikis
wikia上的资源也可供下载:(x)http://community.wikia.com/wiki/Help:Database_download
Freebase:
freebase是啥׃解释?jin),下面l出数据的下载地址Q?br />http://wiki.freebase.com/wiki/Data_dumps freebase自n的数?br />http://wiki.freebase.com/wiki/WEX freebase从wikipedia中提取的数据
YAGO2:
http://www.mpi-inf.mpg.de/yago-naga/yago/
dbpedia:
http://www.dbpedia.org
如果要找LinkedDataQ可以来q里Q?a rel="nofollow" target="_blank">http://www.thedatahub.org q里攉?jin)很多Linked Data
http://linkeddata.org/ q里有一张图Q给Z(jin)各种linkeddata的关pd影响力?br />
如果要找各种|上的apiQ可以来q里Q?a rel="nofollow" target="_blank">http://www.programmableweb.com
现在外国政府UL(fng)对外公开数据Q下面是几个政府的开放数据集Q?br />http://data.gov.au 澛_利亚
http://data.dc.gov 国哥u比亚州的
http://www.data.gov 国
http://data.gov.uk 英国
http://databases.lapl.org/ z杉矶地区的开放数据集Q知道硅谷ؓ(f)啥这么牛?jin)?br />http://www.gov.hk/en/theme/psi/welcome 香港政府也公开?jin)很多数?br />Ҏ(gu)一下,外国政府做了(jin)q么多实事,人民大会(x)堂里的那些酒囊饭袋们都在q什么?
http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexAccess/current/web/download.html 国国家卫生|发布的词表
http://www.census.gov/genealogy/www/data/2000surnames/index.html 国l计局的姓名数?br />https://www.cia.gov/library/publications/download/ 国中央情报局发布的factbookQ介l了(jin)世界各国情况
q卫生vQ统计局和中情局q种单位都ؓ(f)国的信息徏讑ցZ(jin)q么多的贡献Q我们应该知道自p帝的差距有多大?jin)吧?br />
叙词表:(x)
http://www.nlm.nih.gov/mesh/filelist.html mesh,关于d的受控词?br />http://id.loc.gov/download/ 国国会(x)图书馆发布的叙词?br />
一些三元组数据Q?br />http://www.cs.utexas.edu/users/pclark/dart/ 采集自BNCQ英国国家语料库Q和ReutersQ?300万条
http://reverb.cs.washington.edu/ 华盛大学的目Q?500万条
http://www.cs.washington.edu/research/sherlock-hornclauses/ 大约?00-300万条数据
http://www.cs.rochester.edu/research/knext ?35万条数据Q来自BNC和布朗语料库
http://rtw.ml.cmu.edu/rtw/resources readtheweb目Q数据量较小
词典Q?br />http://wordnet.princeton.edu/ p的wordnet
http://nlpwww.nict.go.jp/wn-ja/index.en.html 日语的wordnet
http://alpage.inria.fr/~sagot/wolf-en.html 法语的wordnet
http://wordnet.ru/ 俄罗斯的wordnet
http://cl.haifa.ac.il/projects/mwn/index.shtml 希伯来语的wordnet
http://wordnet.dk/dannet/menu?item=2 业w语的wordnet
http://grial.uab.es/sensem/download?idioma=en 西班牙语的wordnet
http://www.ling.helsinki.fi/en/lt/research/finnwordnet/download.shtml 芬兰语的wordnet
q些不同版本的wordnet都是免费下蝲的。可恨中国泱׃千年的文明古国,文献典故如烟vQ竟q一份免费且公开的机读词兔R没有。这是汉语的耻iQ中国的耻iQ也是中华民族的耻i。特别是中科院计所和自动化所的h们,你们觉得呢?Q顺hownet生意兴隆Q越卖越好)(j)
http://dico.fj.free.fr/dico.php 日法词典
http://www.csse.monash.edu.au/~jwb/edict.html 日英词典
http://cc-cedict.org/wiki/start 中文到英文的词典Q终于出来中文的?jin),可惜是外国h搞出来的?br />https://framenet.icsi.berkeley.edu Z框架语义学的东东Q恐怕不能算词典Q不q没地儿放了(jin)?br />
语料库:(x)
http://opus.lingfil.uu.se/ 开攄q语料?br />http://opus.lingfil.uu.se/OpenSubtitles_v2.php 大量?sh)?jing)字幕的下载地址
http://www.statmt.org/europarl Ƨ洲议会(x)的^行语料库
http://www.anc.org/OANC/ 开攄国国家语料?br />
http://snap.stanford.edu/data/ 斯坦大学的SNAP目Q抓?jin)很多数据,不过旉较早Q只有研Ih(hun)?/p>
之前没有接触q这些YӞso每一个都需要装....
(1)apache配置
在Debian下, 安装完成后, 软g包ؓ(f)我们提供的配|文件位?etc/apache2目录下:(x)
tony@tonybox:/etc/apache2$ ls -l
total 72
-rw-r--r-- 1 root root 12482 2006-01-16 18:15 apache2.conf
drwxr-xr-x 2 root root 4096 2006-06-30 13:56 conf.d
-rw-r--r-- 1 root root 748 2006-01-16 18:05 envvars
-rw-r--r-- 1 root root 268 2006-06-30 13:56 httpd.conf
-rw-r--r-- 1 root root 12441 2006-01-16 18:15 magic
drwxr-xr-x 2 root root 4096 2006-06-30 13:56 mods-available
drwxr-xr-x 2 root root 4096 2006-06-30 13:56 mods-enabled
-rw-r--r-- 1 root root 10 2006-06-30 13:56 ports.conf
-rw-r--r-- 1 root root 2266 2006-01-16 18:15 README
drwxr-xr-x 2 root root 4096 2006-06-30 13:56 sites-available
drwxr-xr-x 2 root root 4096 2006-06-30 13:56 sites-enabled
drwxr-xr-x 2 root root 4096 2006-01-16 18:15
其中
apache2.conf
为apache2服务器的主配|文Ӟ 查看此配|文Ӟ 你会(x)发现以下内容
# Include module configuration:
Include /etc/apache2/mods-enabled/*.load
Include /etc/apache2/mods-enabled/*.conf
# Include all the user configurations:
Include /etc/apache2/httpd.conf
# Include ports listing
Include /etc/apache2/ports.conf
# Include generic snippets of statements
Include /etc/apache2/conf.d/[^.#]*
有此可见Q?apache2 Ҏ(gu)配置功能的不同, 寚w|文件进行了(jin)分割Q?q样更利于管?/p>
conf.d
下ؓ(f)配置文g的附加片断,默认情况下, 仅提供了(jin) charset 片断Q?/p>
tony@tonybox:/etc/apache2/conf.d$ cat charset
AddDefaultCharset UTF-8
如有需要我们可以将默认~码修改?GB2312, x件的内容为:(x) AddDefaultCharset GB2312
httpd.conf
是个I文?/p>
magic
文g中包含的是有关mod_mime_magic模块的数据, 一般不需要修改它?/p>
ports.conf
则ؓ(f)服务器监听IP和端口设|的配置文gQ?/p>
tony@tonybox:/etc/apache2$ cat ports.conf
Listen 80
mods-available
目录下是一些。conf和。load 文gQ?为系l中可以使用的加载各U模块的配置文gQ?而mods-enabled目录下则是指向这些配|文件的W号q接Q?从配|文件apache2.conf 中可以看出, pȝ通过mods-enabled目录来加载模块, 也就是说Q?pȝ仅通过在此目录下创Z(jin)W号q接的mods-available 目录下的配置文g来加载模块。同时系l还提供?jin)两个命?a2enmod ?a2dismod用于l护q些W号q接。这两个命o(h)?apache2-common 包提供。命令各式也非常单:(x) a2enmod [module] ?a2dismod [module]
sites-available
目录下ؓ(f)配置好的站点的配|文Ӟ sites-enabled 目录下则是指向这些配|文件的W号q接Q?pȝ通过q些W号q接来v用站?sites-enabled目录下的W号q接附有一个数字前~Q??00-default, q个数字用于军_启动序Q?数字小Q?启动优先U越高?pȝ提供?jin)两个命?a2ensite ?a2dissite 用于l护q些W号q接。这两个命o(h)?apache2-common 包提供?/p>
/var/www
默认情况下将要发布的|页文g应该|于/var/www目录下,q一默认值可以同q主配置文g中的DocumnetRoot 选项修改?/p>
?mediawiki直接解压到apache里面(是解压在var/www路径?,解压后重名ؓ(f)wikiQ?/p>
? 然后q主localhost/wikiQ对MediaWikiq行安装。去创徏数据库wikidb。里面有41个表。在导入数据之间Q要先清除page,revision,text三个表?/p>
delete from page;
delete from revision;
delete from text;
?http://dumps.wikimedia.org/backup-index.html在这里可以下载Q何语awiki的数据库xml文g。下载的文gcM于enwiki-20061130-pages-articles.xml.bz2Q英文版的)(j)Qwiki差不多每两个月更Cơ数据?/p>
?安装mediawiki。去下蝲mediawiki的源代码Q如果其官方|站被封的话可以去www.allwiki.comq个中文|站上去下蝲。下载后解压C的apache能找到的一个目录下Q将其config目录权限讄?77Q然后在览器里讉K?config/index.phpQ进行一些配|后Q会(x)在config目录下生成一个LocalSettings.php的文Ӟ这个文件拷贝到它的上一U目录。最后别忘了(jin)config的目录再改回原来的权限?/p>
?把文件导入数据库Q?nbsp;
命o(h)Q?nbsp;
java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5
enwiki-20061130-pages-articles.xml.bz2 | mysql -u wikiuser -p wikidb
参见Q?a >http://fuhao-987.iteye.com/blog/1044933
http://jgs80.blog.163.com/blog/static/3566265320076177435762/
Note: This information comes from "Bracketing Guidelines for Treebank II Style Penn Treebank Project" - part of the documentation that comes with the Penn Treebank.
Contents:
Bracket Labels
Clause Level
Phrase Level
Word Level
Function Tags
Form/function discrepancies
Grammatical role
Adverbials
Miscellaneous
Index of All Tags
Bracket Labels
Clause Level
S - simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion.
SBAR - Clause introduced by a (possibly empty) subordinating conjunction.
SBARQ - Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ.
SINV - Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ - Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
Phrase Level
ADJP - Adjective Phrase.
ADVP - Adverb Phrase.
CONJP - Conjunction Phrase.
FRAG - Fragment.
INTJ - Interjection. Corresponds approximately to the part-of-speech tag UH.
LST - List marker. Includes surrounding punctuation.
NAC - Not a Constituent; used to show the scope of certain prenominal modifiers within an NP.
NP - Noun Phrase.
NX - Used within certain complex NPs to mark the head of the NP. Corresponds very roughly to N-bar level but used quite differently.
PP - Prepositional Phrase.
PRN - Parenthetical.
PRT - Particle. Category for words that should be tagged RP.
QP - Quantifier Phrase (i.e. complex measure/amount phrase); used within NP.
RRC - Reduced Relative Clause.
UCP - Unlike Coordinated Phrase.
VP - Vereb Phrase.
WHADJP - Wh-adjective Phrase. Adjectival phrase containing a wh-adverb, as in how hot.
WHAVP - Wh-adverb Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing a wh-adverb such as how or why.
WHNP - Wh-noun Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing some wh-word, e.g. who, which book, whose daughter, none of which, or how many leopards.
WHPP - Wh-prepositional Phrase. Prepositional phrase containing a wh-noun phrase (such as of which or by whose authority) that either introduces a PP gap or is contained by a WHNP.
X - Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the...the-constructions.
Word level
CC - Coordinating conjunction
CD - Cardinal number
DT - Determiner
EX - Existential there
FW - Foreign word
IN - Preposition or subordinating conjunction
JJ - Adjective
JJR - Adjective, comparative
JJS - Adjective, superlative
LS - List item marker
MD - Modal
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
PDT - Predeterminer
POS - Possessive ending
PRP - Personal pronoun
PRP$ - Possessive pronoun (prolog version PRP-S)
RB - Adverb
RBR - Adverb, comparative
RBS - Adverb, superlative
RP - Particle
SYM - Symbol
TO - to
UH - Interjection
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
WDT - Wh-determiner
WP - Wh-pronoun
WP$ - Possessive wh-pronoun (prolog version WP-S)
WRB - Wh-adverb
Function tags
Form/function discrepancies
-ADV (adverbial) - marks a constituent other than ADVP or PP when it is used adverbially (e.g. NPs or free ("headless" relatives). However, constituents that themselves are modifying an ADVP generally do not get -ADV. If a more specific tag is available (for example, -TMP) then it is used alone and -ADV is implied. See the Adverbials section.
-NOM (nominal) - marks free ("headless") relatives and gerunds when they act nominally.
Grammatical role
-DTV (dative) - marks the dative object in the unshifted form of the double object construction. If the preposition introducing the "dative" object is for, it is considered benefactive (-BNF). -DTV (and -BNF) is only used after verbs that can undergo dative shift.
-LGS (logical subject) - is used to mark the logical subject in passives. It attaches to the NP object of by and not to the PP node itself.
-PRD (predicate) - marks any predicate that is not VP. In the do so construction, the so is annotated as a predicate.
-PUT - marks the locative complement of put.
-SBJ (surface subject) - marks the structural surface subject of both matrix and embedded clauses, including those with null subjects.
-TPC ("topicalized") - marks elements that appear before the subject in a declarative sentence, but in two cases only:
if the front element is associated with a *T* in the position of the gap.
if the fronted element is left-dislocated (i.e. it is associated with a resumptive pronoun in the position of the gap).
-VOC (vocative) - marks nouns of address, regardless of their position in the sentence. It is not coindexed to the subject and not get -TPC when it is sentence-initial.
Adverbials
Adverbials are generally VP adjuncts.
-BNF (benefactive) - marks the beneficiary of an action (attaches to NP or PP).
This tag is used only when (1) the verb can undergo dative shift and (2) the prepositional variant (with the same meaning) uses for. The prepositional objects of dative-shifting verbs with other prepositions than for (such as to or of) are annotated -DTV.
-DIR (direction) - marks adverbials that answer the questions "from where?" and "to where?" It implies motion, which can be metaphorical as in "...rose 5 pts. to 57-1/2" or "increased 70% to 5.8 billion yen" -DIR is most often used with verbs of motion/transit and financial verbs.
-EXT (extent) - marks adverbial phrases that describe the spatial extent of an activity. -EXT was incorporated primarily for cases of movement in financial space, but is also used in analogous situations elsewhere. Obligatory complements do not receive -EXT. Words such as fully and completely are absolutes and do not receive -EXT.
-LOC (locative) - marks adverbials that indicate place/setting of the event. -LOC may also indicate metaphorical location. There is likely to be some varation in the use of -LOC due to differing annotator interpretations. In cases where the annotator is faced with a choice between -LOC or -TMP, the default is -LOC. In cases involving SBAR, SBAR should not receive -LOC. -LOC has some uses that are not adverbial, such as with place names that are adjoined to other NPs and NAC-LOC premodifiers of NPs. The special tag -PUT is used for the locative argument of put.
-MNR (manner) - marks adverbials that indicate manner, including instrument phrases.
-PRP (purpose or reason) - marks purpose or reason clauses and PPs.
-TMP (temporal) - marks temporal or aspectual adverbials that answer the questions when, how often, or how long. It has some uses that are not strictly adverbial, auch as with dates that modify other NPs at S- or VP-level. In cases of apposition involving SBAR, the SBAR should not be labeled -TMP. Only in "financialspeak," and only when the dominating PP is a PP-DIR, may temporal modifiers be put at PP object level. Note that -TMP is not used in possessive phrases.
Miscellaneous
-CLR (closely related) - marks constituents that occupy some middle ground between arguments and adjunct of the verb phrase. These roughly correspond to "predication adjuncts", prepositional ditransitives, and some "phrasel verbs". Although constituents marked with -CLR are not strictly speaking complements, they are treated as complements whenever it makes a bracketing difference. The precise meaning of -CLR depends somewhat on the category of the phrase.
on S or SBAR - These categories are usually arguments, so the -CLR tag indicates that the clause is more adverbial than normal clausal arguments. The most common case is the infinitival semi-complement of use, but there are a variety of other cases.
on PP, ADVP, SBAR-PRP, etc - On categories that are ordinarily interpreted as (adjunct) adverbials, -CLR indicates a somewhat closer relationship to the verb. For example:
Prepositional Ditransitives
In order to ensure consistency, the Treebank recognizes only a limited class of verbs that take more than one complement (-DTV and -PUT and Small Clauses) Verbs that fall outside these classes (including most of the prepositional ditransitive verbs in class [D2]) are often associated with -CLR.
Phrasal verbs
Phrasal verbs are also annotated with -CLR or a combination of -PRT and PP-CLR. Words that are considered borderline between particle and adverb are often bracketed with ADVP-CLR.
Predication Adjuncts
Many of Quirk's predication adjuncts are annotated with -CLR.
on NP - To the extent that -CLR is used on NPs, it indicates that the NP is part of some kind of "fixed phrase" or expression, such as take care of. Variation is more likely for NPs than for other uses of -CLR.
-CLF (cleft) - marks it-clefts ("true clefts") and may be added to the labels S, SINV, or SQ.
-HLN (headline) - marks headlines and datelines. Note that headlines and datelines always constitute a unit of text that is structurally independent from the following sentence.
-TTL (title) - is attached to the top node of a title when this title appears inside running text. -TTL implies -NOM. The internal structure of the title is bracketed as usual.
Index of All Tags
ADJP
-ADV
ADVP
-BNF
CC
CD
-CLF
-CLR
CONJP
-DIR
DT
-DTV
EX
-EXT
FRAG
FW
-HLN
IN
INTJ
JJ
JJR
JJS
-LGS
-LOC
LS
LST
MD
-MNR
NAC
NN
NNS
NNP
NNPS
-NOM
NP
NX
PDT
POS
PP
-PRD
PRN
PRP
-PRP
PRP$ or PRP-S
PRT
-PUT
QP
RB
RBR
RBS
RP
RRC
S
SBAR
SBARQ
-SBJ
SINV
SQ
SYM
-TMP
TO
-TPC
-TTL
UCP
UH
VB
VBD
VBG
VBN
VBP
VBZ
-VOC
VP
WDT
WHADJP
WHADVP
WHNP
WHPP
WP
WP$ or WP-S
WRB
X
召回率和准确率是搜烦(ch)引擎Q或其它(g)索系l)(j)的设计中很重要的两个概念和指标?br />召回率:(x)RecallQ又U?#8220;查全?#8221;Q?
准确率:(x)PrecisionQ又U?#8220;_ֺ”?#8220;正确?#8221;?br />在一个大规模数据集合中检索文档时Q可把集合中的所有文档分成四c:(x)
相关 |
不相?/div> | |
(g)索到 |
A |
B |
未检索到 |
C |
D |
AQ检索到的,相关?nbsp; Q搜到的也想要的Q?br />BQ检索到的,但是不相关的 Q搜到的但没用的Q?br />CQ未(g)索到的,但却是相关的 Q没搜到Q然而实际上惌的)(j)
DQ未(g)索到的,也不相关?nbsp; Q没搜到也没用的Q?/p>
通常我们希望Q数据库中相关的文档Q被(g)索到的越多越好,q是q求“查全?#8221;Q即A/(A+C)Q越大越好?br />同时我们q希望:(x)(g)索到的文档中Q相关的多好Q不相关的越越好,q是q求“准确?#8221;Q即A/(A+B)Q越大越好?br />
归纳如下Q?br />召回率:(x)(g)索到的相x??库中所有的相关文档
准确率:(x)(g)索到的相x??所有被(g)索到的文?br />
“召回?#8221;?#8220;准确?#8221;虽然没有必然的关p(从上面公式中可以看到Q,然而在大规模数据集合中Q这两个指标却是怺制约的?br />׃“(g)索策?#8221;q不完美Q希望更多相关的文档被检索到Ӟ攑֮“(g)索策?#8221;Ӟ往往也会(x)伴随出现一些不相关的结果,从而准确率受到媄(jing)响?br />而希望去除检索结果中的不相关文档Ӟ务必要将“(g)索策?#8221;定的更加严格Q这样也?x)有一些相关的文档不再能被(g)索到Q从而召回率受到媄(jing)响?/p>
凡是设计到大规模数据集合的检索和选取Q都涉及(qing)?#8220;召回?#8221;?#8220;准确?#8221;q两个指标。而由于两个指标相互制U,我们通常也会(x)Ҏ(gu)需要ؓ(f)“(g)索策?#8221;选择一个合适的度,不能太严g不能太松Q寻求在召回率和准确率中间的一个^衡点。这个^衡点由具体需求决定?/p>
其实Q准率QprecisionQ精度)(j)比较好理解。往往难以q速反应的?#8220;召回?#8221;。我惌与字面意思也有关p,?#8220;召回”的字面意思不能直接看到其意义?br />我觉?#8220;召回?#8221;q个词翻译的不够好?#8220;召回”在中文的意思是Q把xx调回来。比如sony甉|有问题,厂家召回?br />既然说翻译的不好Q我们回头看“召回?#8221;对应的英?#8220;recall”Qrecall除了(jin)有上面说到的“order sth to return”的意思之外,q有“remember”的意思?/p>
RecallQthe ability to remember sth. that you have learned or sth. that has happened in the past.
q里Qrecall应该是这个意思,q样更Ҏ(gu)理解“召回?#8221;的意思了(jin)?br />当我们问(g)索系l某一件事的所有细节时Q输入检索queryQ,Recall是指:(x)(g)索系l能“回忆”起那些事的多细节,通俗来讲是“回忆的能?#8221;。能回忆h的细节数 除以 pȝ知道qg事的所有细节,是“记忆?#8221;Q也是recall——召回率?br />
q样惻I要容易的多了(jin)?/p>
关于马尔可夫铄定义Q?nbsp;http://zh.wikipedia.org/wiki/%E9%A6%AC%E5%8F%AF%E5%A4%AB%E9%8F%88
隐含马尔可夫模型是上q马?dng)可夫链的一个扩展:(x)M一个时刻t的状态St是不可见的。隐含马?dng)可夫模型在每一个时刻t?x)输Z个符P而且q个W合和st相关Q而且仅和st相关Q这个被UCؓ(f)独立输出假设。关于隐含马?dng)可夫模型的成功应用可以参见吴军的《数学之》第5章的内容?br /> 额,快到上班旉?jin),ȝ一下。l码农中......
今天六一Q?font face="Times New Roman">C加不在w边Q球啊。Q务需要在看曼宁的《统计自然语a处理基础》。然后用C信息Q每ơ我觉得好高q名字Q做下去的时候就发现没有那么难?/font>
搭配
搭配由有限的复合构词法所描述?/span>
识别搭配对的Ҏ(gu)有三U:(x)1.使用频率信息的搭配识别?/font>2.Z含义和主词搭配词之间的距识别?/font>3.Z假设试和互信息的识别?/font>
1.频率
语料过滤后得到的动词,名词Q之间进行两两配对,l计每个词语在一个句子,或在一个段落中出现的次敎ͼ即ؓ(f)频率?/span>
2.均值和方差
׃两个词之间的距离是可以变化的Q计两个词之间的偏U量的均值和方差?/span>
均值就是简单的q_偏移量?/span>
方差衡量的是单独的偏U量偏离均值的距离Q?/span>
是同?font face="Times New Roman">i的偏U量Q?/font>
表示的是h偏移量的均倹{?nbsp;
我们可以通过使用q个信息来发现搭配。具体的Ҏ(gu)是通过L带有低偏差的词对。一个低的偏差值意味着q两个词通常大致相同距离出现。零偏差意味着q两个词L以相同的距离出现?/span>
方差是关于一个相对于其他词分布峰值情늚度量?/span>
关于互信?/span>
互信息的计算公式是这L(fng)Q?/span>
MI(a,b) = log( p(ab) / (p(a)*p(b)) )
其中log的底数是2Q?/font>p(x)表示x出现的概率?/font>
好吧Q好_(d)好简单。。着手写代码?jin)?/span>
一、书c:(x)
1?a target="_blank" style="color: #ca0000; text-decoration: none; ">《自然语a处理lD》英文版W二?/a>
2?a target="_blank" style="color: #ca0000; text-decoration: none; ">《统计自然语a处理基础》英文版
3?a target="_blank" style="color: #ca0000; text-decoration: none; ">《用Pythonq行自然语言处理》,NLTK配套?/a>
4?a target="_blank" style="color: #ca0000; text-decoration: none; ">《Learning PythonW三版?/a>QPython入门l典书籍Q详l而不厌其?br />5?a target="_blank" style="color: #ca0000; text-decoration: none; ">《自然语a处理中的模式识别?/a>
6?a target="_blank" style="color: #ca0000; text-decoration: none; ">《EM法?qing)其扩展?/a>
7?a target="_blank" style="color: #ca0000; text-decoration: none; ">《统计学?fn)基?/a>
8、?a target="_blank" style="color: #ca0000; text-decoration: none; ">自然语言理解》英文版Q似乎只有前9章)(j)
9?a target="_blank" style="color: #ca0000; text-decoration: none; ">《Fundamentals of Speech Recognition?/a>Q质量不太好Q不q第6章关于HMM的部分比较详l,作者之一便是Lawrence RabinerQ?br />10、概率统计经典入门书Q《概率论?qing)其应用》(英文版,威廉*费勒著)(j)
W一?/a> W二?/a> DjVuLibre阅读?/a>Q阅d两卷书需要)(j)
11、一本利用Perl和Prologq行自然语言处理的介l书c:(x)?a target="_blank" style="color: #ca0000; text-decoration: none; ">An Introduction to Language Processing with Perl and Prolog?br />12、国外机器学?fn)书c之Q?br /> 1) “Programming Collective Intelligence“Q中文译名《集体智慧编E》,机器学习(fn)&数据挖掘领域”q年出的入门好书Q培d是最重要的一环,一上来看大部头很容易被吓走?#8221;
2) “Machine Learning“,机器学习(fn)领域无可争议的经怹c,下蝲完毕后~改ؓ(f)pdf卛_。豆瓣评?by王宁Q:(x)老书Q牛人。现在看来内容ƈ不算深,很多章节有点Cؓ(f)止的感觉Q但是很适合新手Q当?dng)不?#8221;?#8221;到连法和概率都不知道)(j)入门。比如决{树(wi)部分很_ֽQƈ且这几年没有特别大的q展Q所以ƈ不过时。另外,q本书算是对97q前数十q机器学?fn)工作的大综qͼ参考文献列表极有h(hun)倹{国内有译和媄(jing)印版Q不知道l版否?br /> 3) “Introduction to Machine Learning”
13、国外数据挖掘书c之Q?br /> 1) “Data.Mining.Concepts.and.Techniques.2nd“Q数据挖掘经怹c?作?: Jiawei Han/Micheline Kamber 出版C?: Morgan Kaufmann 评语 : 华裔U学家写的书Q相当深入浅出?br /> 2) Data Mining:Practical Machine Learning Tools and Techniques
3) Beautiful Data: The Stories Behind Elegant Data SolutionsQ?Toby Segaran, Jeff HammerbacherQ?br />14、国外模式识别书c之Q?br /> 1Q?#8220;Pattern Recognition”
2Q?#8220;Pattern Recongnition Technologies and Applications”
3Q?#8220;An Introduction to Pattern Recognition”
4Q?#8220;Introduction to Statistical Pattern Recognition”
5Q?#8220;Statistical Pattern Recognition 2nd Edition”
6Q?#8220;Supervised and Unsupervised Pattern Recognition”
7Q?#8220;Support Vector Machines for Pattern Classification”
15、国外h工智能书c之Q?br /> 1Q?a target="_blank" style="color: #ca0000; text-decoration: none; ">Artificial Intelligence: A Modern Approach (2nd Edition) 人工领域无争议的l典?br /> 2Q?#8220;Paradigms of Artificial Intelligence Programming: Case Studies in Common LISP”
16、其他相关书c:(x)
1Q?a target="_blank" style="color: #ca0000; text-decoration: none; ">Programming the Semantic WebQToby Segaran , Colin Evans, Jamie Taylor
2Q?a target="_blank" style="color: #ca0000; text-decoration: none; ">Learning.PythonW四?/a>Q英?/p>
二、课Ӟ(x)
1、哈工大刘挺老师?#8220;l计自然语言处理”课gQ?br />2、哈工大刘秉权老师?#8220;自然语言处理”课gQ?br />3、中U院计算所刘群老师?#8220;计算语言学讲?/a>“课gQ?br />4、中U院自动化所宗成?jin)老师?#8220;自然语言理解”课gQ?br />5、北大常宝宝老师?#8220;计算语言?/a>”课gQ?br />6、北大詹卫东老师?#8220;中文信息处理基础”的课件及(qing)相关代码Q?br />7、MIT Regina Barzilay教授?#8220;自然语言处理”课gQ?2nlp上翻译了(jin)?章;
8、MIT大牛Michael Collins?#8220;Machine Learning Approaches for Natural Language Processing(面向自然语言处理的机器学?fn)方?”课gQ?br />9、Michael Collins?#8220;Machine Learning Q机器学?fn)?j)”课gQ?br />10、SMT牛hPhilipp Koehn “Advanced Natural Language ProcessingQ高U自然语a处理Q?#8221;课gQ?br />11、Philipp Koehn “Empirical Methods in Natural Language Processing”课gQ?br />12、Philipp Koehn“Machine TranslationQ机器翻译)(j)”课gQ?/p>
三、语a资源和开源工P(x)
1、Brown语料库:(x)
a) XML格式的brown语料?/a>Q带词性标注;
b) 普通文本格式的brown语料?/a>Q带词性标注;
c) 合ƈq去除空行、行首空|用于词性标注训l:(x)browntest.zip
2?a target="_blank" style="color: #ca0000; text-decoration: none; ">NLTK官方提供的语料库资源列表
3?a target="_blank" style="color: #ca0000; text-decoration: none; ">OpenNLP上的开源自然语a处理工具列表
4、斯坦福大学自然语言处理l维护的“l计自然语言处理?qing)基于语料库的计语a学资源列?/a>”
5?a target="_blank" style="color: #ca0000; text-decoration: none; ">LDC上免费的中文信息处理资源
6、中文分词相兛_P(x)
1QJava版本的MMSEGQ?a target="_blank" style="color: #ca0000; text-decoration: none; ">mmseg-v0.3.zipQ作者ؓ(f)sololQ详情可参见Q?a target="_blank" style="color: #ca0000; text-decoration: none; ">中文分词入门之篇?/a>?br /> 2Q张华^老师的ICTCLAS2010Q该版本非商用免费一q_(d)下蝲地址Q?br />http://cid-51de2738d3ea0fdd.skydrive.live.com/self.aspx/.Public/ICTCLAS2010-packet-release.rar
7、热?j)读?#8220;finallyliuyu”提供的一Ҏ(gu)闻语料库Q包括腾讯,新浪Q网易,凤凰{,目前攑֜CSDN上:(x)http://finallyliuyu.download.csdn.net/
另外finalllyliuyu?010q?月又提供?jin)一Ҏ(gu)本文c语料,详情见:(x)献给热衷于自然语a处理的业余爱好者的中文新闻分类语料库之?/a>
四、文献:(x)
1、ACL-IJCNLP 2009论文全集Q?br /> a) 大会(x)论文Full PaperW一?/a>
b) 大会(x)论文Full PaperW二?/a>
c) 大会(x)论文Short Paper合集
d) ACL09之EMNLP-2009合集
e) ACL09 所有workshop论文合集