ParseKit - how to handle preprocessor statements correctly?
I wrote a C grammar for ParseKit, which does work perfectly, but what drives me crazy are preprocessor statements. What's the correct symbol definitions for preprocessor statements?
Here's the short example of what I've tried ...
@reportsCommentTokens = YES; @commentState = '/'; @singleLineComments = '//'; @multiLineComments = '/*' '*/'; @commentState.fallbackState = delimitState; @delimitState.fallbackState = symbolState; @start = Empty | comments | preprocessor; comments = comment*; comment = Comment; @symbols = '#include'; preprocessor = preprocessorIncludes; preprocessorIncludes = preprocessorIncludeStatement*; preprocessorIncludeStatement = preprocessorInclude quotedFileName*; preprocessorInclude = '#include'; quotedFileName = QuotedString;
... but it doesn't work. Take it as simplified grammar example to catch comments and include statement with quotes (not with < >). I tried this grammar on this simple file ...
/* * Cryptographic API. * * RIPEMD-256 - RACE Integrity Primitives Evaluation Message Digest. * * Based on the reference implementation by Antoon Bosselaers, ESAT-COSIC * * Copyright (c) 2008 Adrian-Ken Rueegsegger <email@example.com> * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the Free * Software Foundation; either version 2 of the License, or (at your option) * any later version. * */ // Here's one line comment /* One line multiline comment */ #include "ripemd.h" /* 2nd one line multiline comment */
... and it ends at /* One line multiline comment */, reports it as comment token and then it silently fails.
So I tried to separate '#include' symbol to ...
@symbolState = '#' '#'; @symbol = '#'; numSymbol = '#'; preprocessorInclude = numSymbol 'include';
... but it still doesn't help.
Maybe Todd can help, but what's the correct way to handle 'symbols' like '#include'?
Developer of ParseKit here.
Robert, your grammar is very close, but I found that your use of nested * (zero-or-more) modifiers was causing the grammar to fail.
I think the problem is that your @start grammar production already has Empty as a top-level option (|ed with the other two productions), but then the sub-productions for comments and preprocessor both contain productions with the * (zero-or-more) modifier. Those *s should really be + (one-or-more) modifiers because you have already accounted for the zero case with the top-level Empty.
I'm not entirely sure, but I don't think this is a problem unique to ParseKit, but rather, I suspect the grammar was problematic and this issue might have been seen with any such grammar toolkit. (could be wrong)
With that in mind, some small tweaks to the grammar have fixed it for me. Here's the edited grammar with the small tweaks:
@reportsCommentTokens = YES; @commentState = '/'; @singleLineComments = '//'; @multiLineComments = '/*' '*/'; @commentState.fallbackState = delimitState; @delimitState.fallbackState = symbolState; @start = (comments | preprocessor)*; comments = comment+; comment = Comment; @symbols = '#include'; preprocessor = preprocessorIncludes; preprocessorIncludes = preprocessorIncludeStatement+; preprocessorIncludeStatement = preprocessorInclude quotedFileName; preprocessorInclude = '#include'; quotedFileName = QuotedString;
Notice my replacement of the Empty in the top-level with a *. And my swapping of the nested *s with +s.
With this edited grammar, I get the desired output (truncated slightly for clarity):
[/* * Cryptographic API. ... */, // Here's one line comment, /* One line multiline comment */, #include, "ripemd.h", /* 2nd one line multiline comment */]/* * Cryptographic API. ... *//// Here's one line comment//* One line multiline comment *//#include/"ripemd.h"//* 2nd one line multiline comment */^
Also, to find the issue, I rewrote the grammar to be simpler. It was easier to find the issue that way. Then I re-applied what I found to your original grammar. Here's the simplified grammar I came up with in case you are interested. This is how I think of this particular grammar in my mind:
@reportsCommentTokens = YES; @commentState = '/'; @singleLineComments = '//'; @multiLineComments = '/*' '*/'; @start = (comment | macro)*; comment = Comment; macro = include; // to support other macros, add: ` | define | ifdef` etc. include = '#' 'include' QuotedString;