Question:
I wrote a parser, it collects urls and headers, they are stored in the database, how can you implement the functionality when the parser can go to these urls and copy data from there to the database? Tried it through loops, but somehow it doesn't work.
Answer:
./requrce_url_extract.py
#!/usr/bin/env python3
import re
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_re = '(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'
def get_url(page):
'''Get URL page and return all found URLs as list'''
url_ls = [ ]
html = urlopen(page)
bsobj = BeautifulSoup(html.read(), 'lxml')
for link in bsobj.find_all('a', href=re.compile(url_re)):
url_ls.append(link.get('href'))
return(url_ls)
def lister(ls):
for i in ls:
print("\tURL {}".format(i))
def main():
page = str(sys.argv[1])
parent_urls = get_url(page)
for p_url in parent_urls:
print(p_url)
lister(get_url(p_url))
if __name__ == "__main__":
main()
Exhaust example:
$ ./requrce_url_extract.py "http://gnu.org" | head -n20
https://www.fsf.org/associate/support_freedom?referrer=4052
URL https://www.fsf.org
URL https://fsf.org/associate/benefits
URL https://my.fsf.org/join/check
URL https://my.fsf.org/associate/support_freedom/renew_fsf
URL https://my.fsf.org/user?destination=civicrm%2Fcontribute%2Ftransact%3Freset%3D1%26id%3D38
URL https://www.gnu.org/thankgnus
URL https://www.fsf.org/about/free-software-foundation-privacy-policy
URL https://civicrm.org/
URL https://www.fsf.org/about/free-software-foundation-privacy-policy
URL http://agpl.fsf.org/crm.fsf.org/CURRENT/
URL https://weblabels.fsf.org/crm.fsf.org/CURRENT/
https://www.pureos.net/
URL http://repo.pureos.net/pureos/pool/main/
URL https://tracker.pureos.net/
URL https://twitter.com/puri_sm
URL https://mastodon.social/@purism
URL https://creativecommons.org/licenses/by-sa/4.0/
URL https://puri.sm/