Error in "utf-8" in python 3

Question:

I have a problem in python 3 that in the code:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import urllib.request

page = urllib.request.urlopen("http://beans-r-us.biz/prices.html")

text = page.read().decode('utf8')

print(text)`

gives the error :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 1265: invalid continuation byte`

and I don't know what to do to fix it

Note: I'm still a beginner in programming, this code is part of the book "use your programming head", and its purpose is to "show" the site.

Answer:

The error happens on the following line:

text = page.read().decode('utf8')

It tries to decode the above site page using UTF-8 encoding, but fails to find some malformed byte. The content of the page is as follows:

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=shift_jis"><meta http-equiv="Content-Language" content="ja,en"><script type="text/javascript">\r\n\r\n  var _gaq = _gaq || [];\r\n  _gaq.push([\'_setAccount\', \'UA-20569835-2\']);\r\n  _gaq.push([\'_trackPageview\']);\r\n\r\n  (function() {\r\n    var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\n    ga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\n    var s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n  })();\r\n\r\n</script><title>404 Not Found</title></head><body oncontextmenu="return false;" style="width: 100% !important; height: 2600px !important;">\r\n<center><a href="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=1"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=1&asz=0&atp=2&lnk=6666ff&bg=&txt=000000&pbb=1"></a></center>\r\n<center><a href="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=2"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=2&asz=0&atp=2&lnk=6666ff&bg=&txt=000000"></a></center>\r\n\r\n\r\n<center><FONT SIZE="2">ミンナ�ホが選んだ�ゥ11/07のランキング�ソ</FONT></center>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<br>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<a name="madop"></a>\r\n<br>\r\n<center><font size="2">他のキーワードで探してみる</FONT></center><center>\r\n<form method="get" action="/genre23.php">\r\n<font size="2"><input type="text" name="query2" value="" size="8"><font size="4">\r\n<SELECT name="genre">\r\n<OPTION value="3">��</OPTION>\r\n\r\n</SELECT>\r\n</FONT><input type="submit" value=" 探す�マ "></FONT>\r\n<input type="hidden" name="cache" value=""><input type="hidden" name="fname" value="">\r\n</form>\r\n</center><br>\r\n<center><font size="2" color="red"><b><a href="/inq/disclaimer.php?ngdom=beans-r-us.biz&ngk=retire%20your%20vehicle">利用規約・削除依頼</a></b></FONT></center>\r\n<br></body></html>'

As you can see, there are several oriental characters present. It is likely that he had trouble decoding any of these.

Scroll to Top