We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
<a href="about:blank;" onclick="alert('Haxored again!!!');">click meh</a>
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.