antlr 4 iso-8859-15 encoded file matching string containing \u0161 š

I have this grammar:

KEY
: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]* 
;

Reading a ISO-8859-15 encoded text file

new ANTLRFileStream(fileName, "ISO-8859-15")

with the string Milešovka. Why is š giving a token recognition error?

Trace:

 line 110:6 token recognition error at: ''exit    field, LT(1)={

EDIT: I am using antlr 4.5.1 (and have tested 4.4 - same issue).

Answers


I think the problem might be in a way you use to generate parser. I'm not sure what exactly could go wrong, but I managed to do a working example with your symbol, that uses maven to generate grammar.

pom.xml

<build>
    <plugins>
        <plugin>
            <groupId>org.antlr</groupId>
            <artifactId>antlr4-maven-plugin</artifactId>
            <version>4.5</version>
            <configuration>
                <outputDirectory>src/main/java</outputDirectory>
                <listener>false</listener>
                <visitor>true</visitor>
            </configuration>
            <executions>
                <execution>
                    <goals>
                        <goal>antlr4</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.7</source>
                <target>1.7</target>
            </configuration>
        </plugin>
    </plugins>
</build>

<dependencies>
    <dependency>
        <groupId>org.antlr</groupId>
        <artifactId>antlr4-runtime</artifactId>
        <version>4.5.1</version>
    </dependency>
</dependencies>

LexerGrammar.g

lexer grammar TestLexer;

LBR: '[';
RBR: ']';
KEY
: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]*
;

ParserGrammar.g

parser grammar TestParser;

options { tokenVocab=TestLexer; }

rul   : block+ ;
block  : LBR KEY RBR ;

Full example code is here


Ira Baxter's comment answers the question:

Does ANTLRFileStream always provide a stream of Unicode characters to the lexer? [Then \u0161 would be right] Or is that encoding just a way to tell it to read 8 bit bytes, without interpreting them? [Then \u00a8 would be the correct code for "š".]


Need Your Help

Jenkins JaCoCo Coverage with multiple classes with "$"

android testing code-coverage jacoco emma

Using JaCoCo Emma Jenkins Plugin for a long time and successfully getting code coverage metrics but have some repetition of classes with "$" sign. Which brings down line coverage metrics. E.g. Clas...