web scraping - python urllib.request - headers that are likely to work -
working on little script fetch info websites. i'm having trouble http errors.
req = urllib.request.request(lnk['href'], headers={'user-agent': 'mozilla/5.0', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}) page = urllib.request.urlopen(req)
when triest fetch, example, http://www.guru99.com/node-js-tutorial.html
long series of errors, ending 406 unacceptable:
traceback (most recent call last): file "get_links.py", line 45, in <module> page = urllib.request.urlopen(req) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen return opener.open(url, data, timeout) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 471, in open response = meth(req, response) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response 'http', request, response, code, msg, hdrs) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 509, in error return self._call_chain(*args) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain result = func(*args) file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default raise httperror(req.full_url, code, msg, hdrs, fp) urllib.error.httperror: http error 406: not acceptable
googling around have found should fix headers (as have done above) , lots of tutorials how fix headers. except - not works.
is there set of headers not cause problem sites? there python module else has created includes commonly-working headers? there way retry several times different headers until response?
this seems problem web scraping python deals with, , haven't found decent solution.
the following set of headers seems working tested. if else has suggestions, please offer them. i'm interested in solutions trying different headers if 1 set doesn't work.
req = urllib.request.request(lnk['href'], headers={'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_9_3) applewebkit/537.36 (khtml, gecko) chrome/35.0.1916.47 safari/537.36'}) page = urllib.request.urlopen(req)
Comments
Post a Comment