Typically the lexer will have a "mode" or "state" setting, which changes according
to the input. For example, on seeing a <
character, the mode would change to
"tag" mode, and the lexer would tokenize appropriately until it sees a >
.
Then it would enter "contents" mode, and the lexer would return all ofattr='moo'
as a single string. Programming language lexers, for example, handle string
omg='wtf'
literals this way:
string s1 = "y = x+5";
The y = x+5
would never be handled as a mathematical expression and then turned
back into a string. It's recognized as a string literal, because the "
changes the lexer mode.
For languages like XML and HTML, it's probably easier to build a custom parser than
to use one of the parser generators like yacc, bison,
or ANTLR. They have a different structure than programming languages, which are a
better fit for the automatic tools.
If your parser needs to turn a list of tokens back into the string it came from,
that's a sign that something is wrong in the design. You need to parse it a different way.
How does the Lexer know that the string after must be tokenized into separate attributes,
while the string between > and does not need to be?
It doesn't.
Wouldn't it need the Parser to tell it that the first string is within a tag body,
and the second case is outside a tag body?
Yes.
Generally, the lexer turns the input stream into a sequence of tokens. A token has no context -
that is, a token has the same meaning no matter where it occurs in the input stream.
Once the lexing process has completed, each token is treated as a single unit.
For XML, a generated lexer would typically identify integers, identifiers, string literal and so on
as well as the control characters, like '<' and '>' but not a whole tag. The work of understanding
what is an open tag, close tag, attribute, element, etc., is left to the parser proper.