Recursive Descent vs Lex/Parse?

티스토리 뷰

http://weizhishi.com/questions/86145/recursive-descent-vs-lexparse

I think I understand (roughly) how recursive descent parsers (e.g. Scala's Parser Combinators) work: You parse the input string with one parser, and that parser calls other, smaller parsers for each "part" of the whole input, and so on, until you reach the low level parsers which directly generate the AST from fragments of the input string

I also think I understand how Lexing/Parsing works: you first run a lexer to break the whole input into a flat list of tokens, and you then run a parser to take the token list and generate an AST.

However, I do not understand is how the Lex/Parse strategy deals with cases where exactly how you tokenize something depends on the tokens that were tokenized earlier. For example, if I take a chunk of XML:

"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"

A recursive descent parser may take this and break it down (each subsequent indent represents the decomposition of the parent string)

"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>" 
  -> "<tag attr='moo' omg='wtf'>"
       -> "<tag"
       -> "attr='moo'"
            -> "attr"
            -> "="
            -> "moo"
       -> "omg='wtf'"
            -> "omg"
            -> "="
            -> "wtf" 
       -> ">"
  -> "attr='moo' omg='wtf'"
  -> "</tag>"

And the small parsers which individually parse <tag, attr="moo", etc. would then construct a representation of an XML tag and add attributes to it.

However, how does a single-step Lex/Parse work? How does the Lexer know that the string after <tag and before > must be tokenized into separate attributes, while the string between > and </tag> does not need to be? Wouldn't it need the Parser to tell it that the first string is within a tag body, and the second case is outside a tag body?

EDIT: Changed the example to make it clearer

답변들 >>

Typically the lexer will have a "mode" or "state" setting, which changes according
to the input. For example, on seeing a < character, the mode would change to
"tag" mode, and the lexer would tokenize appropriately until it sees a >.
Then it would enter "contents" mode, and the lexer would return all ofattr='moo' omg='wtf' as a single string. Programming language lexers, for example, handle string
literals this way:

string s1 = "y = x+5";

The y = x+5 would never be handled as a mathematical expression and then turned
back into a string. It's recognized as a string literal, because the "changes the lexer mode.

For languages like XML and HTML, it's probably easier to build a custom parser than
to use one of the parser generators like yacc, bison,
or ANTLR. They have a different structure than programming languages, which are a
better fit for the automatic tools.

If your parser needs to turn a list of tokens back into the string it came from,
that's a sign that something is wrong in the design. You need to parse it a different way.

answered Apr 28 at 22:15 by Ashok Kumar Chava

How does the Lexer know that the string after must be tokenized into separate attributes,
while the string between > and does not need to be?

It doesn't.

Wouldn't it need the Parser to tell it that the first string is within a tag body,
and the second case is outside a tag body?

Yes.

Generally, the lexer turns the input stream into a sequence of tokens. A token has no context -
that is, a token has the same meaning no matter where it occurs in the input stream.
Once the lexing process has completed, each token is treated as a single unit.

For XML, a generated lexer would typically identify integers, identifiers, string literal and so on
as well as the control characters, like '<' and '>' but not a whole tag. The work of understanding
what is an open tag, close tag, attribute, element, etc., is left to the parser proper.

저작자표시 비영리 변경금지 (새창열림)

'인터프리터' 카테고리의 다른 글

Scala 언어를 이용한 미니언어 만들기 (0)	2015.06.06
Recursive Descent Parsing (0)	2015.05.19
소형 SQL 인터프리터 만들기 (0)	2015.05.07

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

« 2025/06 »

일

월

화

수

목

금

토

HAMA 블로그

티스토리 뷰

Recursive Descent vs Lex/Parse?

'인터프리터' 카테고리의 다른 글

티스토리툴바