antlr 4 iso-8859-15 encoded file matching string containing \u0161 š

I have this grammar:

: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]* 

Reading a ISO-8859-15 encoded text file

new ANTLRFileStream(fileName, "ISO-8859-15")

with the string Milešovka. Why is š giving a token recognition error?


 line 110:6 token recognition error at: ''exit    field, LT(1)={

EDIT: I am using antlr 4.5.1 (and have tested 4.4 - same issue).


I think the problem might be in a way you use to generate parser. I'm not sure what exactly could go wrong, but I managed to do a working example with your symbol, that uses maven to generate grammar.





lexer grammar TestLexer;

LBR: '[';
RBR: ']';
: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]*


parser grammar TestParser;

options { tokenVocab=TestLexer; }

rul   : block+ ;
block  : LBR KEY RBR ;

Full example code is here

Ira Baxter's comment answers the question:

Does ANTLRFileStream always provide a stream of Unicode characters to the lexer? [Then \u0161 would be right] Or is that encoding just a way to tell it to read 8 bit bytes, without interpreting them? [Then \u00a8 would be the correct code for "š".]

Need Your Help

Jenkins JaCoCo Coverage with multiple classes with "$"

android testing code-coverage jacoco emma

Using JaCoCo Emma Jenkins Plugin for a long time and successfully getting code coverage metrics but have some repetition of classes with "$" sign. Which brings down line coverage metrics. E.g. Clas...