Parser generator for inline documentation

To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.

http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?

Answers


If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:

lexer grammar Foo;

options {filter=true;}

StringLiteral
  :  ...
  ;

CharLiteral
  :  ...
  ;

SingleLineComment
  :  ...
  ;

MultiLineComment
  :  ...
  ;

When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.

Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!


My compiler uses Dypgen. This is a user extenisble GLR parser with lots of enrichments so it can parse many languages. The bootstrap grammar is EBNF like (it supports * + and ? directly in your productions). It is powerful enough to dynamically load extensions, a fact my compiler leverages: the bulk of my programming language has its syntax dynamically loaded at compiler startup.

Dypgen is written in Ocaml and generates Ocaml code.

There is a C++ GLR parser called Elkhound which is powerful enough to parse most of C++.

However, for your actual requirements, you do not really need to do any serious parsing: a regular expression matching engine is probably good enough. Googles re2 may be suitable (provides most PCRE functionality, a lot faster and with C++ interface).

Although this is less accurate, it is good enough because you can demand that inline documentation adhere to some simple formats. Most existing inline docs already do so for just this reason.


Where I work we used to use GOLD Parser. This is a lot simpler that Antlr and supports multiple languages. We have since moved to Antlr however as we needed to do more complex parsing, which we found Antlr was better for than GOLD.


Need Your Help

Calendars and Content types

sharepoint-2007

I've created a SharePoint calendar with content types: Available and Unavailable. The "All Day Event",

Customize color of text on Actionbar

android android-actionbar

I am just starting to learn android development and have been going through the tutorials from android. But I'm stuck at a part trying to change the text color and style of the action bar. I am not...