How to ignore links that do not fit the established conditions and continue with scraping?


I would like to know how to ignore links that do not meet the conditions established in title, date_time and text; thus managing to continue scraping the site.

Error that occurs when a link does not have or does not follow the conditions:"Error in data.frame(title, date_time, text) : arguments imply differing number of rows: 1, 0"

Below is the script:

# iniciar bibliotecas 

#url_base <- ""

url_base <- ""

url_base <- gsub("bolt", "PDT", url_base)

links_saocarlos <- c()
for (i in 1:4){
url1 <- gsub("koxa", i, url_base)
pag<- readLines(url1)
pag<- htmlParse(pag)
pag<- xmlRoot(pag)
links <- xpathSApply(pag, "//div[@class='item']/a", xmlGetAttr, name="href")
links <- paste("", links, sep ="")
links_saocarlos<- c(links_saocarlos, links)


dados <- data.frame()
for(links in links_saocarlos){

pag1<- readLines(links)
pag1<- htmlParse(pag1)
pag1<- xmlRoot(pag1)

    titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
    data_hora <- xpathSApply  (pag1, "//div[@class='horarios']", xmlValue)  
    texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)

dados <- rbind(dados, data.frame(titulo, data_hora, texto))

agregar <- aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')


In your case, I think an if solve it, for example, by replacing the line you put in the database with:

if (length(titulo) == 1 & length(data_hora == 1) & length(texto) == 1){
    dados <- rbind(dados, data.frame(titulo, data_hora, texto))

In other words, "only add this new line if all its elements exist".

However, you could make your sweep more robust as follows:


raspar <- failwith(NULL, function(links){
  pag1 <- readLines(links)
  pag1 <- htmlParse(pag1)
  pag1 <- xmlRoot(pag1)

  titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
  data_hora <- xpathSApply(pag1, "//div[@class='horarios']", xmlValue)  
  texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)

  data.frame(titulo, data_hora, texto)

dados <- ldply(links_saocarlos, raspar)

The failwith function catches errors without stopping execution. This is very good when we are doing webscraping, as connection problems are common, for example, which can cause unexpected errors in the code.

Also, using the plyr (function ldply ) has some advantages with respect to its for . The main one is that you don't grow the object dynamically, which is usually much faster. Another advantage is that you can use the .progress = "text" argument and put a progress bar in your code 🙂

dados <- ldply(links_saocarlos, raspar, .progress = "text")
Scroll to Top