對(duì)于hand written的lexical analyzer來(lái)說(shuō),NFA和DFA的運(yùn)用是不可避免的,除非你的grammer十分簡(jiǎn)單。
一旦給出了source program(也就是你想處理的character stream)的一個(gè)pattern的正則表達(dá)式,就可以構(gòu)造對(duì)應(yīng)的NFA,然后轉(zhuǎn)換為DFA,這個(gè)DFA就可以用來(lái)處理你的source program, 將里面能夠match這個(gè)pattern的lexeme全都找出來(lái)。按照這樣的流程,對(duì)于一種編程語(yǔ)言,不管是常用的語(yǔ)言,還是腳本語(yǔ)言,只要對(duì)所有的pattern構(gòu)造DFA,就能夠?qū)懗鲎约旱膌exical analyzer了。
有兩篇關(guān)于正則表達(dá)式到DFA的文章寫(xiě)的很好:
1.Writing own regular expression parser
By Amer Gerzic英文的
http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx
有源碼
2.
《構(gòu)造正則表達(dá)式引擎》新鮮出爐啦!中文的,by vczh,華南理工大學(xué)
http://www.shnenglu.com/vczh/archive/2008/05/22/50763.html
閱讀完上面兩篇文章,寫(xiě)個(gè)能夠運(yùn)行的lexer就不成問(wèn)題了。
另外附上龍書(shū)(Compilers, principles techniques and tools)里一段token,pattern和lexeme術(shù)語(yǔ)的區(qū)別:
1. A t o k e n is a pair consisting of a token name and an optional attribute
value. The token name is
an abstract symbol representing a kind of
lexical unit(lexeme), e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token. notes:
1. more than one lexeme can match a pattern
2. 看看example 3.1