Is there a simple way in R to extract only the text elements of an HTML page?

Is there a simple way in R to extract only the text elements of an HTML page?

I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url.

Answers


I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl)
library(RTidyHTML)
library(XML)

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r" 
doc.raw <- getURL(u)
doc <- tidyHTML(doc.raw)
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
cat(unlist(txt))

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient) 
u <- "http://stackoverflow.com/questions/tagged?tagnames=r"
ie <- COMCreate("InternetExplorer.Application") 
ie$Navigate(u)
txt <- list()
txt[[u]] <- ie[["document"]][["body"]][["innerText"]] 
ie$Quit() 
print(txt) 

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.


Well it´s not exactly a R way of doing it, but it´s as simple as they come: outwit plugin for firefox. The basic version is for free and helps to extract tables and stuff.

ah and if you really wanna do it the hard way in R, this link is for you:


I've had good luck with the readHTMLTable() function of the XML package. It returns a list of all tables on the page.

> library(XML)
> url <- 'http://en.wikipedia.org/wiki/World_population'
> allTables <- readHTMLTable(url)

There can be many tables on each page.

> length(allTables)
[1] 17

So just select the one you want.

> tbl <- allTables[[3]]

The biggest hassle can be installing the XML package. It's big, and it needs the libxml2 library (and, under Linux, it needs the xml2-config Debian package, too). The second biggest hassle is that HTML tables often contain junk you don't want, besides the data you do want.


The best solution is package htm2txt.

library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text <- gettxt(url)

For details, see https://CRAN.R-project.org/package=htm2txt.


Need Your Help

Freak MySql error with MAMP

macos mamp mysql

I setup MAMP Pro. Before that i setup and deleted MAMP and XAMPP. I maniacally deleted all the files of the former two. Since then Apache is not running on Mamp default port 8888 and - MySql is jus...

How to force timeout functions in python, windows platform

python timeout urllib2

All I want to do is timeout a function if it does not return before that