How to use remote driver on computer protected by proxy through R software package RSelenium?

Question:

Well, I need to access a site on my network, but this is protected by proxy.

Some sites accept using the httr and rvest packages, others do not. To login to a site for examples I can't. Example:

pro    <- use_proxy("minha.proxy", porta, "meuusuario", "minhasenha")
my_session <- html_session(url, pro)

I usually use this proxy function to access the url I want and go through the proxy.

But on certain sites, in the case to log in, this function does not run, or rather I can't log in.

The alternative I found was to use a remote driver using the rsDriver(browser=c("chrome")) , for example. On my personal pc I can unroll all the code via the RSelenium Package remote driver. Now on the work network I can't. The best options I found searching were:

1)

cprof <- list(chromeOptions = list(
args = c('--proxy-server=http://minha.proxy:porta',
         '--proxy-auth=usuario:senha')))
driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)

2)

cprof <- list(chromeOptions = list(
args = c('--proxy-server=http://ip:porta',
         '--proxy-auth=usuario:senha')))
driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)

This to pass the proxy, but it returns in all:

checking Selenium Server versions:
BEGIN: PREDOWNLOAD
Error in open.connection(con, "rb") : 
  Timeout was reached: Connection timed out after 10000 milliseconds

This error is what usually happens when it doesn't pass the proxy (I think!).

So is there any way to go through the proxy and open my remote driver? Well, if you have something to contribute I will be grateful!

Answer:

I found A solution to my problem!

As I'm on an institutional network, I need a proxy to browse the internet. In order for RStudio to use the proxy, it is necessary to define it within the IDE (in the function you are going to use, as in the question) or even change the environment variables as in: Reference 1 ).

That's what I did, I inserted the reference 2 environment variables:

variable name: http_proxy
variable value: https://user_id:password@your_proxy:your_port/

variable name: https_proxy
variable value: https:// user_id:password@your_proxy:your_port

This was the first step. Then I followed the steps said by @JdeMello in Reference 3 . In Reference 3 basically what I did was download and install node.js download node , then I installed puppeteer.js , created it in a notepad and named the file scrape_mustard.js (See file content in Reference 3) and ran scrape_mustard.js by node "create page" with system() function in RStudio.

Here's a script:

setwd("C:\\Program Files\\nodejs") ### 
#OBS.: Tive que mudar o diretório para a pasta no disco C onde o nodejs foi instalado.

## system("npm i puppeteer") ## Esta função fez instalar o Puppteer

library(magrittr)
system("node scrape_mustard.js") ## Rodar o scape_mustard.js e criar a página que preciso

library(httr)
html <- xml2::read_html("~/PAGINA/page.html") ## ler html

html %>% 
rvest::html_nodes("h1") ## capturar o que existe na tag h1

Difficulties:

  • As I installed node on disk C the directory had to be changed in Rstudio to there;
  • The scraper_mustard.js (name can be changed) also had to be moved to nodejs folder on C disk;
  • The page definition must be done inside scraper_mustard.js, that is, edit the file every time before running it (I did it by writeLines() ), but if it is in the nodejs folder (as I did) in disk C, will need administrator permission.

NOTE: I haven't worked on submitting the login page yet, but I was able to get the page I wanted, which was not possible before. Maybe I did the steps found in the references wrongly, but the first step was taken, I thought it was fair to share

Alternatively I will try to use the Docker quoted by @José, I'm still studying this. I hope I was clear! Thank you guys!

Scroll to Top
AllEscort