對于hand written的lexical analyzer來說,NFA和DFA的運用是不可避免的,除非你的grammer十分簡單。
一旦給出了source program(也就是你想處理的character stream)的一個pattern的正則表達式,就可以構造對應的NFA,然后轉換為DFA,這個DFA就可以用來處理你的source program, 將里面能夠match這個pattern的lexeme全都找出來。按照這樣的流程,對于一種編程語言,不管是常用的語言,還是腳本語言,只要對所有的pattern構造DFA,就能夠寫出自己的lexical analyzer了。
有兩篇關于正則表達式到DFA的文章寫的很好:
1.Writing own regular expression parser
By Amer Gerzic英文的
http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx
有源碼
2.
《構造正則表達式引擎》新鮮出爐啦!中文的,by vczh,華南理工大學
http://www.shnenglu.com/vczh/archive/2008/05/22/50763.html
閱讀完上面兩篇文章,寫個能夠運行的lexer就不成問題了。
另外附上龍書(Compilers, principles techniques and tools)里一段token,pattern和lexeme術語的區別:
1. A t o k e n is a pair consisting of a token name and an optional attribute
value. The token name is
an abstract symbol representing a kind of
lexical unit(lexeme), e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token. notes:
1. more than one lexeme can match a pattern
2. 看看example 3.1