HTMLInputFilter: HTML Input Sanitizer for Java

Joseph O'Connell announces HTMLInputFilter, a HTML input sanitizer for Java based on Cal Anderson's lib_filter for PHP (article). Joseph is looking for more people to test his open source library.

Writing and maintaining sanitizer for complex input data values like HTML is an error prone neverending task that should be should be shared, but where are the open source java libraries for this? Googling turns up only warnings and advices. Also, although simple input data values can be easily sanitized with regular expression, it's easy for inexperienced developers to make silly mistakes and I think it is a waste for everyone to handroll everytime given that small number of input data types will handle majority of use cases.

I wonder why there isn't a Jakarta project for this…

Update:

HTMLInputFilter is throwing in unexpected errors which is why it needs more testers:

java.lang.IndexOutOfBoundsException: No group 5

 at java.util.regex.Matcher.group(Matcher.java:463)

 at java.util.regex.Matcher.appendReplacement(Matcher.java:730)

 at com.josephoconnell.html.HTMLInputFilter.validateEntities(HTMLInputFilter.java:470)

 at com.josephoconnell.html.HTMLInputFilter.filter(HTMLInputFilter.java:198)

 at com.docuverse.daily.filter.HTMLInputFilterAdapter.filter(HTMLInputFilterAdapter.java:22)

Don't have time to track down the cause. Also, HTMLInputFilter assumes compiled patterns are cached internally. I remember a bug report long time ago but not sure if this has been implemented or not and, if so, in which version.