I wrote a parser in python, it collects urls and headers, how to make it go to separate pages (urls are already in the database) and parse them?

Question:

I wrote a parser, it collects urls and headers, they are stored in the database, how can you implement the functionality when the parser can go to these urls and copy data from there to the database? Tried it through loops, but somehow it doesn't work.

Answer:

./requrce_url_extract.py

#!/usr/bin/env python3

import re
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup


url_re = '(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'

def get_url(page):
    '''Get URL page and return all found URLs as list'''
    url_ls = [ ]
    html = urlopen(page)
    bsobj = BeautifulSoup(html.read(), 'lxml')
    for link in bsobj.find_all('a', href=re.compile(url_re)):
        url_ls.append(link.get('href'))
    return(url_ls)

def lister(ls):
    for i in ls:
        print("\tURL {}".format(i))

def main():
    page = str(sys.argv[1])

    parent_urls = get_url(page)

    for p_url in parent_urls:
        print(p_url)
        lister(get_url(p_url))

if __name__ == "__main__":
    main()

Exhaust example:

$ ./requrce_url_extract.py  "http://gnu.org"  | head -n20
https://www.fsf.org/associate/support_freedom?referrer=4052
        URL https://www.fsf.org
        URL https://fsf.org/associate/benefits
        URL https://my.fsf.org/join/check
        URL https://my.fsf.org/associate/support_freedom/renew_fsf
        URL https://my.fsf.org/user?destination=civicrm%2Fcontribute%2Ftransact%3Freset%3D1%26id%3D38
        URL https://www.gnu.org/thankgnus
        URL https://www.fsf.org/about/free-software-foundation-privacy-policy
        URL https://civicrm.org/
        URL https://www.fsf.org/about/free-software-foundation-privacy-policy
        URL http://agpl.fsf.org/crm.fsf.org/CURRENT/
        URL https://weblabels.fsf.org/crm.fsf.org/CURRENT/
https://www.pureos.net/
        URL http://repo.pureos.net/pureos/pool/main/
        URL https://tracker.pureos.net/
        URL https://twitter.com/puri_sm
        URL https://mastodon.social/@purism
        URL https://creativecommons.org/licenses/by-sa/4.0/
        URL https://puri.sm/
Scroll to Top