How to abbreviate Html with Java?

A user enters text as HTML in a form , for example :

<p>this is my <strong>blog</strong> post, 
very <i>long</i> and written in <b>HTML</b></p>

I want to be able to ouput only a part of the String ( for example 20 first characters ) without breaking the HTML structure of the user's input. In this case :

<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>

wich renders as

this is my blog post, very lo...

Is there a Java library able to do this, or a simple method to use?

MyLibrary.abbreviateHTML(string,20) ?


Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.

Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:

  • strip all tags and truncate
  • provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc

The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?

So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.

I don't know any library but it should not be so complicated (for 80%). You only need a simple "parser" that understand 4 type of tokens:

  • opening tags - everything that starts with < but not </ and ends with > but not />
  • closing tags - everything that starts with </ and ends with >
  • self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
  • normal character - everything that is none of the other types

Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.

You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).

When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.

But be careful, this works only with the input is well-formed XML.

I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.

If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through and hope for the best.

It seams that there are a lot of libs and tools for this common task:

Need Your Help

Rotating a NxN matrix in Java

java matrix rotational-matrices

This is a question from Cracking the Coding Interview. The solution says that the program rotates the exterior edges, then the interior edges. However, I'm having trouble following the logic of bot...

Build error when using VS 11, .NET 4.5 and Entity Framework

visual-studio entity-framework spatial visual-studio-2012 .net-4.5

In Visual Studio 2010, my solution was using .NET 4.2 (Entity Framework June 2011 CTP) so I could use spatial types in Entity Framework. When I upgraded to Visual Studio 11 Beta, it wouldn't build