Wednesday, September 1, 2010

Ant & Xerces

Have you seen errors like this when parsing XML from Ant tasks?

org.xml.sax.SAXParseException: Invalid encoding name "Cp1252".
  at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
  at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:208)
  ...

The problem is in the encoding specified in the xml header
<?xml version="1.0" encoding="Cp1252"?>
This often appears in xml files generated by Java programs.

Bug 4665105 is filed for this issue but it is closed as "Not a defect".

It turns out that the same XML file is parsed fine by the JRE libraries, but ant uses its own xml libraries - xml-apis.jar and xercesImpl.jar found under ant/lib. If I delete these two files, xml parsing in ant works fine.
Luckily this issue seems solved in latest Ant 1.8.1. From the release notes:
* Ant no longer ships with Apache Xerces-J or the XML APIs but relies
on the Java runtime to provide a parser and matching API versions.

A workaround.
When parsing the xml file instead of
DocumentBuilder.parse(file)
use
DocumentBuilder.parse(new InputSource(new FileReader(file)))
This tells the parser to ignore the encoding specified in the xml header.

No comments:

Post a Comment