python lxml not showing all content

I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag. My code:

import urllib
from lxml import html as LH
import lxml
import requests

scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()

scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[@class="scrolling-script-container"]/text()')
print script
print type(script)

This is the output:

["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r New York's classic rock \r q104.", '3.', ' \r \r Good morning.', " \r I'm jim kerr.", ' \r \r Coming up \r

When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.

for res in script:
    print res

The output is:

q104. 3. Good morning. I'm jim kerr.

I am not confined to lxml, but because I am rather new, I am less familiar with other methods.

Answers


An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.

try:

script = tree.xpath('//div[@class="scrolling-script-container"]')
print join(" ", (script[0].text(), script[0].tail()))

This was bothering me, I wrote out a solution:

import requests
import lxml
from lxml import etree
from io import StringIO

parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)

script = root.xpath('//div[@class="scrolling-script-container"]')
text_list = []

for elem in script:
    print(elem.attrib)
    if hasattr(elem, 'text'):
        text_list.append(elem.text)
    if hasattr(elem, 'tail'):
        text_list.append(elem.tail)

for elem in text_list:
# only gets the first block of text before 
# it encounters a br tag
        print(elem)

for elem in script:
# prints everything 
    for sib in elem.iter():
        print(sib.attrib)
        if hasattr(sib, 'text'):
            print(sib.text)
        if hasattr(sib, 'tail'):
            print(sib.tail)

Need Your Help

Java Code Organization: Where to keep instance of static class

java organization code-organization

This question may be a little bit subjective, but I'm just trying to follow the best programming practices for organization of code.

Can you limit UIView Clip Subviews to certain sides of a view?

ios uiview uiscrollview subview cliptobounds

I have a UIScrollView that contains other subviews that are partially drawn outside of the scrollview. These views extend vertically above the scrollview. Is it possible to only allow the subviews ...