RegEx help needed with HTML tags extraction
I need to extract this text:
Line 1 text. Line 2 text. Line 2 some more text. Line 3 text, Line 4 text
from this HTML:
... <tr><td class="td_my_custom_text">Line 1 text. <br>Line 2 text. Line 2 some more text. <br>Line 3 text, <br>Line 4 text <br></td></tr><tr><td> </td></tr> ...
Using this RegEx: <td\ class="td_my_custom_text">[\s\S]*?</td> I have managed to get something close but not close enough. <td class="td_my_custom_text">, <br> and </td> are still inside and I am stuck.
- What needs to be changed in my regular expression to get rid of them?
- Is there some Windows tool to automate this job and copy just extracted data to new file(s)? I have 5000+ files like this one and I am thinking about making a small program using regex or html parser but I would like to know if there is a better approach first.
It looks you're better off just stripping off the tags because that's essentially what you're doing.
You should also look at dasbinkenlight's link in his comment to understand more about HTML parsing.
You can use regex substitute to remove all html tags (any text within < >) but in your example you will be left with . The best approach would be an HTML parser. Depending on your programming language there may be libraries you can use.
You can try FakeRainBrigand's approach or even adapt it to VBScript: create a .vbs file and add the following test code:
Set objIE = CreateObject("internetexplorer.application") strHTML = "<tr><td class='td_my_custom_text'>Line 1 text. <br>Line 2 text.<br></td></tr><tr><td> </td></tr>" objIE.navigate("about:blank") objIE.document.body.innerHTML = strHTML msgbox objIE.document.body.innerText
Save the file. When opened it will come up with a message box with the parsed HTML. You can then use the Scripting.FileSystemObject to list all files in folder and process one at a time. There are several examples of how to do this - e.g. VBScript to detect today's modified files in a folder (including subfolders inside it) other examples if you google "VBS list all files in folder".
You could use Internet Explorer's COM interface. Using the language AutoHotkey_L.
ex_html = ( <tr><td class="td_my_custom_text">Line 1 text. <br>Line 2 text. Line 2 some more text. <br>Line 3 text, <br>Line 4 text <br></td></tr><tr><td> </td></tr> ) pwb := ComObjCreate("InternetExplorer.Application") pwb.navigate("about:blank") pwb.document.body.innerHTML := ex_html text := pwb.document.body.innerText pwb.quit() MsgBox % text
It navigates to a blank page, injects the HTML code, and then uses the innerText DOM property to clean all special tags.
Running the innerHTML and innerText lines in a loop allows for quickly cleaning all your HTML inputs. Read up on commands like FileRead and Loop (files & folders) for help on accessing multiple input files.