Web scraping com Python (Selenium e Request)

Question:

Hi,

I'm trying to perform a web scraping on a page protected by login, I've managed to access both via Request and via Selenium, the problem occurs after login.

The page is as follows: https://eduardocavalcanti.com/login After login, it redirects to this page automatically: https://eduardocavalcanti.com/dashboard

However, when I log in via browser, if I ask to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ it accesses without problems because I have already logged in.

But this is not working with Request . Even though I asked him to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ he goes to another page.

I'm new to this area, I've already carried out some consultations, but I haven't found a basis for reference.

Request code:

import requests
from bs4 import BeautifulSoup

loginPage = 'https://eduardocavalcanti.com/login/'
protectedPage = 'https://eduardocavalcanti.com/dashboard'
petrUrl = 'https://eduardocavalcanti.com/an_fundamentalista/petr/'
payload = {
    'user_login': 'meu_email@gmail.com',
    'password': 'minhasenha'
}

sess = requests.session()
sess.post(loginPage, data=payload)
#petr = sess.get(protectedPage)
petr = sess.get(petrUrl )
soup = BeautifulSoup(petr.content, 'html.parser')
print(soup)

Selenium code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox()
browser.get("https://eduardocavalcanti.com/an_fundamentalista/itsa/") 
time.sleep(10)
username = browser.find_element_by_name("user_login")
password = browser.find_element_by_name("user_pass")
username.send_keys("meu_email@hotmail.com")
password.send_keys("minha_senha")
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
time.sleep(5)
browser.get("https://eduardocavalcanti.com/an_fundamentalista/petr/")
DadosEmpresa = browser.find_element_by_xpath("/html/body").text
#DadosEmpresa = browser.find_elements_by_xpath("/html/body")
#for item in DadosEmpresa:
    #print(item.text)

The problem I run into is that the framework that Selenium returns would take work to put into a Python dictionary. Is there a way for Selenium to return the page tables in a more structured format? So I could use BeautifulSoup.

Regarding the Request, is there any blocking on the website that prevents him from accessing the page? I've tried using cookies, giving timesleep and nothing worked.

Answer:

The best way for you to transform an HTML table into a more structured format is with the Pandas library. As I don't have access to the logged in area, I'll put a code example for you to adapt to your table:

import pandas as pd 
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox()
browser.get("https://eduardocavalcanti.com/an_fundamentalista/itsa/") 
time.sleep(10)
username = browser.find_element_by_name("user_login")
password = browser.find_element_by_name("user_pass")
username.send_keys("meu_email@hotmail.com")
password.send_keys("minha_senha")
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
time.sleep(5)
browser.get("https://eduardocavalcanti.com/an_fundamentalista/petr/")
DadosEmpresa = browser.find_element_by_xpath("/html/body").text

# Aqui você utiliza o Pandas para transformar a tabela HTML em um DataFrame
df_tabela = pd.read_html(DadosEmpresa)        # carrega tabela HTML em um DataFrame do Pandas
df = df_tabela[['id da empresa','nome da empresa','descricao']]  # alterar para ser igual aos titulos da sua tabela no site
df.columns = ['id','empresa','descricao']        # coloca nome que quiser nas variaveis
print(df)

Here is a tutorial to help you use Pandas: How to use Pandas read_html to Scrape Data from HTML Tables

Web Scraping with Python, Selenium and Pandas – TV Source Code

Scroll to Top