How to strip insignificant whitespace out of HTML

I have to compare different versions of HTML pages for formatting and text changes. Unfortunately the guy/company who creates them uses some kind of HTML editor that re-wraps all the HTML every time (and adds tons of whitespace), which makes it hard to diff them. So I am looking for a tool (preferrably a Java library) that can reformat my HTML in a way that all insignificant spaces and newlines get removed.

That means, in

<h1>First Headline</h1> <h2>Second headline</h2>

the space between </h1> and <h2> should be removed, but in

<b>formatted</b> <i>text</i>

the whitespace may not be removed. I do not care about <pre>, <textarea> or <script> blocks, and also not about CSS whitespace attributes that can change the behavior - I am just looking for a solution that strips most of the unnecessary whitespace (and better leave too much whitespace in than too little).

(I am already collapsing multiple whitespaces and re-adding newlines instead of whitespaces before tags to make the text more readable - but there are still too many cases where for example a new newline between headlines or table cells/rows breaks my simple "solution".)

Answers


JTidy may be of use here. It's an HTML parser that parses the HTML (and is tolerant of ill-formed HTML) and presents the HTML as a DOM, and you can override the writing out of this to remove whatever you're not interested in.


If this is for internal use only, then consider using a converter to XHTML, and then canonicalize the XML. Then it is much easier to compare the results.

Tidy: http://tidy.sourceforge.net/ (output-xhtml option - http://tidy.sourceforge.net/docs/quickref.html#output-xhtml)

Canonicalize: http://en.wikipedia.org/wiki/Canonical_XML


Need Your Help

login via curl and reusing cookie for a second request

php cookies curl

I'm trying to login via curl, and after the login make a second request with the generated cookie. But It doesn't work, I'm not logged in. How can I keep the cookie and reuse it?

How to debug a GWT application running on OSGi?

eclipse debugging gwt osgi pax-runner

I'm developing a web UI using GWT. While working only with the widgets I could debug from Eclipse using the Firefox extension, but now that I'm integrating the UI with other OSGi bundles I cannot use