python lxml not showing all content
I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag. My code:
import urllib from lxml import html as LH import lxml import requests scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21" scripthtml=urllib.urlopen(scripturl).read() scripthtml=requests.get(scripturl) tree = LH.fromstring(scripthtml.content) script=tree.xpath('//div[@class="scrolling-script-container"]/text()') print script print type(script)
This is the output:
["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r New York's classic rock \r q104.", '3.', ' \r \r Good morning.', " \r I'm jim kerr.", ' \r \r Coming up \r
When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.
for res in script: print res
The output is:
q104. 3. Good morning. I'm jim kerr.
I am not confined to lxml, but because I am rather new, I am less familiar with other methods.
An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.
script = tree.xpath('//div[@class="scrolling-script-container"]') print join(" ", (script.text(), script.tail()))
This was bothering me, I wrote out a solution:
import requests import lxml from lxml import etree from io import StringIO parser = etree.HTMLParser() base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21" resp = requests.get(base_url) root = etree.parse(StringIO(resp.text), parser) script = root.xpath('//div[@class="scrolling-script-container"]') text_list =  for elem in script: print(elem.attrib) if hasattr(elem, 'text'): text_list.append(elem.text) if hasattr(elem, 'tail'): text_list.append(elem.tail) for elem in text_list: # only gets the first block of text before # it encounters a br tag print(elem) for elem in script: # prints everything for sib in elem.iter(): print(sib.attrib) if hasattr(sib, 'text'): print(sib.text) if hasattr(sib, 'tail'): print(sib.tail)