python 2: urllib & urllib2

参考:urllib - open arbitrary resources by URL, urllib2 - extensible library for opening URLs



本文需要具备http、cookies等方面的基础知识,快速入门请参考这里

urllib和urllib2是python2标准库中的两个模块,可以通过

>>> help(urllib)
>>> help(urllib2)

查看模块中的函数、变量等详细信息。

[TOP]


基本功能

>>> sock = urllib.urlopen('http://www.xinhuanet.com/')
>>> htmlSource = sock.read()
>>> sock.close()
>>> print htmlSource

抓取的网页可以直接进行解析,也可以保存为page.html:

>>> urllib.urlretrieve('http://www.xinhuanet.com/', './page.html')

解析网页可以采用正则表达式,还可以使用Beautiful Soup等专用模块;如果涉及到编码和解码,可参考这里;涉及到Ajax或动态HTML等异步加载技术时,可使用selenium

源码中的信息

>>> request = urllib2.Request("http://www.xinhuanet.com")  #创建Request对象
>>> response = urllib2.urlopen(request)
>>> print response.read()
>>> print response.readline()
>>> print response.readlines()
>>> print response.fileno()
>>> print response.info() #返回response头信息
>>> print response.getcode() #返回http状态码
>>> print response.geturl() #返回请求的url
>>> print response.close()

在编写代码过程中可随时使用help()获取帮助信息,如:

>>> help(urllib2.Request)
>>> help(urllib2.urlopen)
>>> help(urllib.urlopen)

[TOP]


POST & GET

关于POST和GET方法,请查看这里。下面仅列出python的处理方式。

POST

>>> values = {"username": "NAME", "passwd":"PASSWD"} #将需要POST的数据定义为一个字典
>>> data = urllib.urlencode(values) #将定义的字典编码
>>> url = "https://login.uuspider.com/"
>>> request = urllib2.Request(url, data) #创建Request对象
>>> response = urllib2.urlopen(request)
>>> print resposne.read()

GET

>>> values = {"username": "NAME", "passwd":"PASSWD"} #将需要POST的数据定义为一个字典
>>> data = urllib.urlencode(values) #将定义的字典编码
>>> url = "https://login.uuspider.com/"
>>> url_get = url + "?" + data #构造GET方法的url
>>> request = urllib2.Request(url_get)
>>> response = urllib2.urlopen(request)
>>> print resposne.read()

[TOP]


设置headers

>>> values = {"username": "NAME", "passwd":"PASSWD"} #将需要POST的数据定义为一个字典
>>> data = urllib.urlencode(values) #将定义的字典编码
>>> url = "https://login.uuspider.com/"
>>> user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36' #设置user-agent
>>> referer = 'http://about.uuspider.com/' #设置referer
>>> headers = {'User-Agent' : user_agent, 'Referer' : referer} #构造headers头信息
>>> request = urllib2.Request(url, data, headers) #创建Request对象
>>> response = urllib2.urlopen(request)
>>> print resposne.read()

[TOP]


代理服务器

>>> proxies = {'http': 'http://127.0.0.1:8087', 'https': 'https://127.0.0.1:8087'}
>>> proxy_handler = urllib2.ProxyHandler(proxies)
>>> opener = urllib2.build_opener(proxy_handler) #创建url开启器
>>> response = opener.open('https://www.google.com') #将url(或者Request对象)传递给url开启器并打开打开
>>> print response.read()

也可以这样使用代理:

>>> proxies = {'http': 'http://127.0.0.1:8087', 'https': 'https://127.0.0.1:8087'}
>>> response = urllib.urlopen('https://www.google.com', proxies = proxies)
>>> print response.read()

[TOP]


timeout

>>> response = urllib2.urlopen('http://www.uuspider.com', timeout = 15)

注意:urllib.urlopen()urllib2.urlopen()的用法是有区别的,可通过help()来查看。

>>> help(urllib.urlopen)
urlopen(url, data=None, proxies=None)
>>> help(urllib2.urlopen)
urlopen(url, data=None, timeout=<object object>)

[TOP]


异常处理

try:
    data = urllib.request.urlopen(url)
    print data.read().decode('utf-8')
except urllib.error.HTTPError as e:
    print e.code
except urllib.error.URLError as e:
    print e.reason

异常处理还可以这样处理。

[TOP]