web scraping - python urllib.request - headers that are likely to work -

- July 15, 2013

working on little script fetch info websites. i'm having trouble http errors.

req = urllib.request.request(lnk['href'],    headers={'user-agent': 'mozilla/5.0', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}) page = urllib.request.urlopen(req)

when triest fetch, example, http://www.guru99.com/node-js-tutorial.html long series of errors, ending 406 unacceptable:

traceback (most recent call last):   file "get_links.py", line 45, in <module>     page = urllib.request.urlopen(req)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen     return opener.open(url, data, timeout)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 471, in open     response = meth(req, response)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response     'http', request, response, code, msg, hdrs)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 509, in error     return self._call_chain(*args)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain     result = func(*args)   file "/library/frameworks/python.framework/versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default     raise httperror(req.full_url, code, msg, hdrs, fp) urllib.error.httperror: http error 406: not acceptable

googling around have found should fix headers (as have done above) , lots of tutorials how fix headers. except - not works.

is there set of headers not cause problem sites? there python module else has created includes commonly-working headers? there way retry several times different headers until response?

this seems problem web scraping python deals with, , haven't found decent solution.

the following set of headers seems working tested. if else has suggestions, please offer them. i'm interested in solutions trying different headers if 1 set doesn't work.

req = urllib.request.request(lnk['href'],    headers={'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_9_3) applewebkit/537.36 (khtml, gecko) chrome/35.0.1916.47 safari/537.36'}) page = urllib.request.urlopen(req)

Search This Blog

To form

web scraping - python urllib.request - headers that are likely to work -

Comments

Post a Comment

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -