RegEx help needed with HTML tags extraction

I need to extract this text:

Line 1 text.
Line 2 text. Line 2 some more text.
Line 3 text,
Line 4 text

from this HTML:

...
<tr><td class="td_my_custom_text">Line 1 text. 
<br>Line 2 text. Line 2 some more text.
<br>Line 3 text, 
<br>Line 4 text
<br></td></tr><tr><td>&nbsp;</td></tr>
...

Using this RegEx: <td\ class="td_my_custom_text">[\s\S]*?</td> I have managed to get something close but not close enough. <td class="td_my_custom_text">, <br> and </td> are still inside and I am stuck.

  1. What needs to be changed in my regular expression to get rid of them?
  2. Is there some Windows tool to automate this job and copy just extracted data to new file(s)? I have 5000+ files like this one and I am thinking about making a small program using regex or html parser but I would like to know if there is a better approach first.

Answers


It looks you're better off just stripping off the tags because that's essentially what you're doing.

You should also look at dasbinkenlight's link in his comment to understand more about HTML parsing.


You can use regex substitute to remove all html tags (any text within < >) but in your example you will be left with &nbsp;. The best approach would be an HTML parser. Depending on your programming language there may be libraries you can use.

You can try FakeRainBrigand's approach or even adapt it to VBScript: create a .vbs file and add the following test code:

Set objIE = CreateObject("internetexplorer.application")

strHTML = "<tr><td class='td_my_custom_text'>Line 1 text. <br>Line 2 text.<br></td></tr><tr><td>&nbsp;</td></tr>"

objIE.navigate("about:blank")
objIE.document.body.innerHTML = strHTML

msgbox objIE.document.body.innerText

Save the file. When opened it will come up with a message box with the parsed HTML. You can then use the Scripting.FileSystemObject to list all files in folder and process one at a time. There are several examples of how to do this - e.g. VBScript to detect today's modified files in a folder (including subfolders inside it) other examples if you google "VBS list all files in folder".


You could use Internet Explorer's COM interface. Using the language AutoHotkey_L.

ex_html =
(
<tr><td class="td_my_custom_text">Line 1 text. 
<br>Line 2 text. Line 2 some more text.
<br>Line 3 text, 
<br>Line 4 text
<br></td></tr><tr><td>&nbsp;</td></tr>
)


pwb := ComObjCreate("InternetExplorer.Application")
pwb.navigate("about:blank")
pwb.document.body.innerHTML := ex_html
text := pwb.document.body.innerText
pwb.quit()


MsgBox % text

It navigates to a blank page, injects the HTML code, and then uses the innerText DOM property to clean all special tags.

Running the innerHTML and innerText lines in a loop allows for quickly cleaning all your HTML inputs. Read up on commands like FileRead and Loop (files & folders) for help on accessing multiple input files.


Need Your Help

Does the python logging run in it's own thread?

python multithreading logging

I've found that logging I/O is a performance bottleneck of our program, so I'm thinkig of moving all logging I/O to a seperate thread. The problem is that I don't really know what's going on in the

How to change xml response to json?

php json xml sdk ebay

I'm using davidsadler's PHP ebay SDK to integrate ebay in my PHP application. Every response is in the form of this type of xml. I've echoed before this response to see it clearly.