Python使用urllib2模块抓取HTML页面资源的实例分享_Python

Python使用urllib2模块抓取HTML页面资源的实例分享

2020-08-21 10:51larry Python

这篇文章主要介绍了Python使用urllib2模块抓取HTML页面资源的实例分享,将要抓取的页面地址写在单独的规则列表中方便组织和重复使用,需要的朋友可以参考下

先把要抓取的网络地址列在单独的list文件中

									http://www.zzvips.com/article/83440.html

									http://www.zzvips.com/article/83437.html

									http://www.zzvips.com/article/83430.html

									http://www.zzvips.com/article/83449.html

然后我们来看程序操作，代码如下：

									#!/usr/bin/python

									import os

									import sys

									import urllib2

									import re

									def Cdown_data(fileurl, fpath, dpath):

									 if not os.path.exists(dpath):

									  os.makedirs(dpath)

									 try:

									  getfile = urllib2.urlopen(fileurl) 

									  data = getfile.read()

									  f = open(fpath, 'w')

									  f.write(data)

									  f.close()

									 except:

									 print

									with open('u1.list') as lines:

									 for line in lines:

									  URI = line.strip()

									  if '?' and '%' in URI:

									   continue

									 elif URI.count('/') == 2:

									   continue

									  elif URI.count('/') > 2:

									   #print URI,URI.count('/')

									  try:

									    dirpath = URI.rpartition('/')[0].split('//')[1]

									    #filepath = URI.split('//')[1].split('/')[1]

									    filepath = URI.split('//')[1]

									   if filepath:

									     print URI,filepath,dirpath

									     Cdown_data(URI, filepath, dirpath)

									   except:

									    print URI,'error'