Getting attribute's value in Nokogiri to extract link URLs

I have a document which look like this:

<div id="block">
    <a href="">link</a>

I can't get Nokogiri to get me the value of href attribute. I'd like to store the address in a Ruby variable as a string.


html = <<HTML
  <div id="block">
    <a href="">link</a>
doc = Nokogiri::HTML(html)
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="">]

Or if you wanna be more specific about the div:

>> doc.xpath('//div[@id="block"]/a/@href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="">]
>> doc.xpath('//div[@id="block"]/a/@href').first.value
=> ""

doc = Nokogiri::HTML(open("[insert URL here]"))
href = doc.css('#block a')[0]["href"]

The variable href is assigned to the value of the "href" attribute for the <a> element inside the element with id 'block'. The line doc.css('#block a') returns a single item array containing the attributes of #block a. [0] targets that single element, which is a hash containing all the attribute names and values. ["href"] targets the key of "href" inside that hash and returns the value, which is a string containing the url.

Having struggled with this question in various forms, I decided to write myself a tutorial disguised as an answer. It may be helpful to others.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
    <a href="">link1</a>
<div id="block2">
    <a href="">link2</a>
    <a id="tips">just a bookmark</a>

doc = Nokogiri::HTML(html)
extracting all the links

We can use xpath or css to find all the elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath {|element| element["href"]}.compact  # => ["", ""]

nodeset = doc.css('a')          # Get all anchors via css {|element| element["href"]}.compact  # => ["", ""]

But there's a better way: in the above cases, the .compact is necessary because the searches return the "just a bookmark" element as well. We can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath {|attr| attr.value}   # => ["", ""]

nodeset = doc.css('a[href]')    # Get anchors w href attribute via css {|element| element["href"]}  # => ["", ""]
finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => ""

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => ""

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value          # => ""

element = doc.at_css('div#block2 a[href]')
element['href']        # => ""
find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"]     # => ""

element = doc.at_css('a:contains("link2")')
element["href"]     # => ""
find text from a link

And what if you want to find the text associated with a particular link? Not a problem:

element = doc.at_xpath('//a[@href=""]')
element.text     # => "link2"

element = doc.at_css('a[href=""]')
element.text     # => "link2"
useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

doc = Nokogiri::HTML("HTML ...")
href = doc.css("div[id='block'] > a")
result = href['href'] #

data = '<html lang="en" class="">
    <a href="" media="all" rel="stylesheet" /> link1</a>
    <a href="" media="all" rel="stylesheet" />link2</a>
    <a href="" media="all" rel="stylesheet" />link3</a>

Here is my Try for above sample of HTML code:

doc = Nokogiri::HTML(data)
=> [,,]

document.css("#block a")["href"]

where document is the Nokogiri HTML parsed.

