flex lexer - (F) Lex, how do I match negation? -
some language grammars use negations in rules. example, in dart specification following rule used:
~('\'|'"'|'$'|newline)
which means match not 1 of rules inside parenthesis. now, know in flex can negate character rules (ex: [^ab] , of rules want negate more complicated single character don't think use character rules that. example may need negate sequence '"""' multiline strings i'm not sure way in flex be.
(tl;dr: skip down bottom practical answer.)
the inverse of regular language regular language. in theory possible write inverse of regular expression regular expression. unfortunately, not easy.
the """
case, @ least, not difficult.
first, let's clear trying match.
strictly speaking "not """
" mean "any string other """
". include, example, x"""
.
so might tempting we're looking "any string not contain """
". (that is, inverse of .*""".*
). that's not quite correct either. typical usage tokenise input like:
"""this string might contain " or ""."""
if start after initial """
, longest string doesn't contain """
, find:
this string might contain " or "".""
whereas wanted was:
this string might contain " or "".
so turns out need "any string not end "
, doesn't contain """
", conjunction of 2 inverses: (~.*" ∧ ~.*""".*)
it's (relatively) easy produce state diagram that:
(note difference between above , state diagram "any string not contain """
" in state diagram, states accepting, , in 1 states 1 , 2 not accepting.)
now, challenge turn regular expression. there automated techniques doing that, regular expressions produce long , clumsy. case simple, though, because there 1 accepting state , need describe paths can end in state:
([^"]|\"([^"]|\"[^"]))*
this model work simple string, it's little more complicated when string not sequence of same character. example, suppose wanted match strings terminated end
rather """
. naively modifying above pattern result in:
([^e]|e([^n]|n[^d]))* <--- don't use
but regular expression match string
enendstuff shouldn't have been matched
the real state diagram we're looking
and 1 way of writing regular expression is:
([^e]|e(e|ne)*([^en]|n[^ed]))
again, produced tracing ways end in state 0:
[^e] stays in state 0 e in state 1: (e|ne)*: stay in state 1 [^en]: state 0 n[^ed]:back state 0 via state 2
this can lot of work, both produce , read. , results error-prone. (formal validation easier state diagrams, small class of problems, rather regular expressions can grow enormous).
a practical , scalable solution
practical flex rulesets use start conditions solve kind of problem. example, here how might recognize python triple-quoted strings:
%x tripleq start \"\"\" end \"\"\" %% {start} { begin( tripleq ); /* note: no return, flex continues */ } <tripleq>.|\n { /* append next token yytext instead of * replacing yytext next token */ yymore(); /* no return yet, flex continues */ } <tripleq>{end} { /* we've found end of string, * need rid of terminating """ */ yylval.str = malloc(yyleng - 2); memcpy(yylval.str, yytext, yyleng - 3); yylval.str[yyleng - 3] = 0; return string; }
this works because .
rule in start condition tripleq
not match "
if "
part of string matched {end}
; flex chooses longest match. made more efficient using [^"]+|\"|\n
instead of .|\n
, because result in longer matches , consequently fewer calls yymore()
; didn't write way above clarity.
this model easier extend. in particular, if wanted use <![cdata[
start , ]]>
terminator, we'd need change definitions
start "<![cdata[" end "]]>"
(and possibly optimized rule inside start condition, if using optimization suggested above.)
Comments
Post a Comment