Python实现的爬取网易动态评论操作示例_Python

Python实现的爬取网易动态评论操作示例

2021-03-01 00:24小傲娇的认真 Python

这篇文章主要介绍了Python实现的爬取网易动态评论操作,结合实例形式分析了Python针对网易评论正则爬取及json格式数据转换、提取等相关操作技巧,需要的朋友可以参考下

本文实例讲述了Python实现的爬取网易动态评论操作。分享给大家供大家参考，具体如下：

打开网易的一条新闻的源代码后，发现并没有所要得评论内容。

经过学习后发现，源代码只是一个完整页面的“骨架”，而我所需要的内容是它的填充物，这时候需要打开工具里面的开发人员工具，从加载的“骨肉”里找到我所要的评论

Python实现的爬取网易动态评论操作示例

圈住的是类型

找到之后打开网页，发现json类型的格式，用我已学过的正则，bs都不好闹，于是便去了解了正则，发现把json的格式换化成python的格式后，用列表提取内容是一条明朗的道路。。。

但是在细致分析的时候也发现了问题

Python实现的爬取网易动态评论操作示例

从这里获得每条评论时，感觉有点不对，观察发现如果是回复评论的评论会出现他回复那条评论的数据，于是用正则提取了一下

最终的代码如下：

									#coding=utf-8

									__author__ = 'kongmengfan123'

									import urllib

									import re

									import json

									import time

									def gethothtml(url):#最热评论

									  page=urllib.urlopen(url)

									  html=page.read()

									  get_json(html)

									def gethnewtml():#最新评论有5页

									  for i in range(1,6):

									    url = 'http://comment.news.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/C4QFIJNS0001875O/comments/newList?offset=%d&limit=30&showLevelThreshold=72&headLimit=1&tailLimit=2&callback=getData&ibc=newspc&_=1478010624978'%i*30

									    page = urllib.urlopen(url)

									    html=page.read()

									    time.sleep(1)

									    get_json(html)

									def get_json(json_):

									  end_=re.compile(r'\);')#将json网页转化成python数据

									  begain=re.compile(r'getData\(')

									  json_=begain.sub('',json_)

									  json_=end_.sub('',json_)

									  ajson=json.loads(json_)

									  lis=ajson["commentIds"]#获得每条评论的键

									  n=0

									  for i in range(1,len(lis)):

									    try:

									      xulie=re.compile('\d{10,}')#取得准确评论的键（去掉回复）

									      bia=re.findall(xulie,lis[n])

									      w.write(ajson['comments'][bia[len(bia)-1]]['user']['nickname'].encode('utf-8')+'|')

									    except KeyError:

									      w.write(ajson['comments'][bia[len(bia)-1]]['user']['location'].encode('utf-8')+'|')

									    if (len(lis[n])>13):

									      xulie=re.compile('\d{10,}')

									      bia=re.findall(xulie,lis[n])

									      w.write(ajson['comments'][bia[len(bia)-1]]['content'].encode('utf-8')+'\n')

									    else:

									       w.write(ajson['comments'][lis[n]]['content'].encode('utf-8')+'\n')

									    n=n+1

									  return lis

									w=open('wangyi.txt','w')

									w.write('用户名'+'|'+'热门评论'+'\n')

									hot_=gethothtml('http://comment.news.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/C4QFIJNS0001875O/comments/hotList?offset=0&limit=40&showLevelThreshold=72&headLimit=1&tailLimit=2&callback=getData&ibc=newspc')

									w.write('用户名'+'|'+'最新评论'+'\n')

									gethnewtml()

									w.close()