Well, I need to access a site on my network, but this is protected by proxy.
Some sites accept using the httr and rvest packages, others do not. To login to a site for examples I can't. Example:
pro <- use_proxy("minha.proxy", porta, "meuusuario", "minhasenha") my_session <- html_session(url, pro)
I usually use this proxy function to access the url I want and go through the proxy.
But on certain sites, in the case to log in, this function does not run, or rather I can't log in.
The alternative I found was to use a remote driver using the
rsDriver(browser=c("chrome")) , for example. On my personal pc I can unroll all the code via the RSelenium Package remote driver. Now on the work network I can't. The best options I found searching were:
cprof <- list(chromeOptions = list( args = c('--proxy-server=http://minha.proxy:porta', '--proxy-auth=usuario:senha'))) driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)
cprof <- list(chromeOptions = list( args = c('--proxy-server=http://ip:porta', '--proxy-auth=usuario:senha'))) driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)
This to pass the proxy, but it returns in all:
checking Selenium Server versions: BEGIN: PREDOWNLOAD Error in open.connection(con, "rb") : Timeout was reached: Connection timed out after 10000 milliseconds
This error is what usually happens when it doesn't pass the proxy (I think!).
So is there any way to go through the proxy and open my remote driver? Well, if you have something to contribute I will be grateful!
I found A solution to my problem!
As I'm on an institutional network, I need a proxy to browse the internet. In order for RStudio to use the proxy, it is necessary to define it within the IDE (in the function you are going to use, as in the question) or even change the environment variables as in: Reference 1 ).
That's what I did, I inserted the reference 2 environment variables:
variable name: http_proxy variable value: https://user_id:password@your_proxy:your_port/ variable name: https_proxy variable value: https:// user_id:password@your_proxy:your_port
This was the first step. Then I followed the steps said by @JdeMello in Reference 3 . In Reference 3 basically what I did was download and install
node.js download node , then I installed
puppeteer.js , created it in a notepad and named the file
scrape_mustard.js (See file content in Reference 3) and ran
node "create page" with
system() function in RStudio.
Here's a script:
setwd("C:\\Program Files\\nodejs") ### #OBS.: Tive que mudar o diretório para a pasta no disco C onde o nodejs foi instalado. ## system("npm i puppeteer") ## Esta função fez instalar o Puppteer library(magrittr) system("node scrape_mustard.js") ## Rodar o scape_mustard.js e criar a página que preciso library(httr) html <- xml2::read_html("~/PAGINA/page.html") ## ler html html %>% rvest::html_nodes("h1") ## capturar o que existe na tag h1
- As I installed node on disk C the directory had to be changed in Rstudio to there;
- The scraper_mustard.js (name can be changed) also had to be moved to nodejs folder on C disk;
- The page definition must be done inside scraper_mustard.js, that is, edit the file every time before running it (I did it by
writeLines()), but if it is in the nodejs folder (as I did) in disk C, will need administrator permission.
NOTE: I haven't worked on submitting the login page yet, but I was able to get the page I wanted, which was not possible before. Maybe I did the steps found in the references wrongly, but the first step was taken, I thought it was fair to share
Alternatively I will try to use the Docker quoted by @José, I'm still studying this. I hope I was clear! Thank you guys!