Issue with HTMLAgilityPack parsing HTML using C#

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;

http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx

I currently have the following code;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

I've used an XPath addon for Chrome to get the XPath of;

//*table[@id='indu_table']/tbody/tr[*]/td/b/a

When running my project, I get an xpath unhandled exception about it being an invalid token.

I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.

I've been looking at this for the last hour, is it anything simple?

thanks

Answers


Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [.... Each stock is one element in the array and is an array itself.

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.


Why won't you just use Descendants("a") method? It's much simplier and is more object oriented. You'll just get a bunch of objects. The you can just get the "href" attribute from those objects.

Sample code:

htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value

If you just need list of links from certain webpage, this method will do just fine.


If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.


Need Your Help

PHP with Screen Width condition

php wordpress responsive-design

I'm trying to limit my posts on Wordpress if the screen width is less than 480px (mobile device, responsive).