flex lexer - (F) Lex, how do I match negation? -


some language grammars use negations in rules. example, in dart specification following rule used:

~('\'|'"'|'$'|newline) 

which means match not 1 of rules inside parenthesis. now, know in flex can negate character rules (ex: [^ab] , of rules want negate more complicated single character don't think use character rules that. example may need negate sequence '"""' multiline strings i'm not sure way in flex be.

(tl;dr: skip down bottom practical answer.)

the inverse of regular language regular language. in theory possible write inverse of regular expression regular expression. unfortunately, not easy.

the """ case, @ least, not difficult.

first, let's clear trying match.

strictly speaking "not """" mean "any string other """". include, example, x""".

so might tempting we're looking "any string not contain """". (that is, inverse of .*""".*). that's not quite correct either. typical usage tokenise input like:

"""this string might contain " or "".""" 

if start after initial """ , longest string doesn't contain """, find:

this string might contain " or ""."" 

whereas wanted was:

this string might contain " or "". 

so turns out need "any string not end " , doesn't contain """", conjunction of 2 inverses: (~.*" ∧ ~.*""".*)

it's (relatively) easy produce state diagram that:enter image description here

(note difference between above , state diagram "any string not contain """" in state diagram, states accepting, , in 1 states 1 , 2 not accepting.)

now, challenge turn regular expression. there automated techniques doing that, regular expressions produce long , clumsy. case simple, though, because there 1 accepting state , need describe paths can end in state:

([^"]|\"([^"]|\"[^"]))* 

this model work simple string, it's little more complicated when string not sequence of same character. example, suppose wanted match strings terminated end rather """. naively modifying above pattern result in:

([^e]|e([^n]|n[^d]))*   <--- don't use 

but regular expression match string

enendstuff shouldn't have been matched 

the real state diagram we're looking enter image description here

and 1 way of writing regular expression is:

([^e]|e(e|ne)*([^en]|n[^ed])) 

again, produced tracing ways end in state 0:

[^e] stays in state 0 e    in state 1:      (e|ne)*: stay in state 1      [^en]: state 0      n[^ed]:back state 0 via state 2 

this can lot of work, both produce , read. , results error-prone. (formal validation easier state diagrams, small class of problems, rather regular expressions can grow enormous).


a practical , scalable solution

practical flex rulesets use start conditions solve kind of problem. example, here how might recognize python triple-quoted strings:

%x tripleq start \"\"\" end   \"\"\" %%  {start}        { begin( tripleq ); /* note: no return, flex continues */ }  <tripleq>.|\n  { /* append next token yytext instead of                   * replacing yytext next token                   */                  yymore();                  /* no return yet, flex continues */                } <tripleq>{end} { /* we've found end of string,                   * need rid of terminating """                   */                  yylval.str = malloc(yyleng - 2);                  memcpy(yylval.str, yytext, yyleng - 3);                  yylval.str[yyleng - 3] = 0;                  return string;                } 

this works because . rule in start condition tripleq not match " if " part of string matched {end}; flex chooses longest match. made more efficient using [^"]+|\"|\n instead of .|\n, because result in longer matches , consequently fewer calls yymore(); didn't write way above clarity.

this model easier extend. in particular, if wanted use <![cdata[ start , ]]> terminator, we'd need change definitions

start "<![cdata[" end   "]]>" 

(and possibly optimized rule inside start condition, if using optimization suggested above.)


Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -