Is it a bad idea to escape HTML before inserting into a database instead of upon output?
I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
also see this perfect owasp article on xss prevention
Yes, because at some stage you'll want access to the original input entered. This is because...
- You never know how you want to display it - in JSON, in HTML, as an SMS?
- You may need to show it back to the user as is.
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
They'll only get the 3 if it is a regex.
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.
Another elusive issue: Suppose you are entering a record with the string R&B in it's title. It will be stored as R&B. And assume we have a search function which uses the SQL:
$query = $database->prepare('SELECT * FROM table WHERE title LIKE ?'); $query->execute(array($searchString.'%'));
Now if someone searches R&B, it won't match this row, as it is stored as R&B. The situation is the same for equality, sorting etc.
Of course, here we have the issue of not searching HTML tags, as <span>'s will be matching when someone searches for span. This could be solved by delegating the search functionality to some external service like Solr, or by storing a version in a second field which is cleared of HTML tags, special characters and such (for full text search) similar to what @limscoder suggested.
One day you may be exposing your data via an API or something, and your API users may assume it un-escaped.
A few months later, a new team member joins. As a well trained developer, he always uses html escaping, now only to see everything is double-escaped (e.g. there are titles showing up like He said "nuff" instead of He said "nuff").
Quote style of htmlspecialchars() (e.g. ENT_QUOTES, ENT_COMPAT etc) is going to bite you, if you are using anything other than the default one and forget to use the same quoting style in both storing/outputting.
A similar issue happens when you use htmlentities() to store, and htmlspecialchars() to output, or vice versa (with corresponding counter-functions). Your HTML will be polluted with Üs, Çs etc.
These are more prone to be abused if there are multiple developers working on the same codebase.