python3之Splash的具体使用_Python

splash是一个javascript渲染服务。它是一个带有http api的轻量级web浏览器，使用twisted和qt5在python 3中实现。qt反应器用于使服务完全异步，允许通过qt主循环利用webkit并发。
一些splash功能：

并行处理多个网页
获取html源代码或截取屏幕截图
关闭图像或使用adblock plus规则使渲染更快
在页面上下文中执行自定义javascript
可通过lua脚本来控制页面的渲染过程
在splash-jupyter 笔记本中开发splash lua脚本。
以har格式获取详细的渲染信息

1、scrapy-splash的安装

scrapy-splash的安装分为两部分，一个是splash服务的安装，具体通过docker来安装服务，运行服务会启动一个splash服务，通过它的接口来实现javascript页面的加载；另外一个是scrapy-splash的python库的安装，安装后就可在scrapy中使用splash服务了，下面我们分三部份来安装：

(1)安装docker

				?

									#安装所需要的包：

									yum install -y yum-utils device-mapper-persistent-data lvm2

									#设置稳定存储库：

									yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

									#开始安装docker ce：

									yum install docker-ce

									#启动dockers：

									systemctl start docker

									#测试安装是否正确：

									docker run hello-world

(2)安装splash服务

通过docker安装scrapinghub/splash镜像，然后启动容器，创建splash服务

				?

									docker pull scrapinghub/splash

									docker run -d -p 8050:8050 scrapinghub/splash

									#通过浏览器访问8050端口验证安装是否成功

(3)python包scrapy-splash安装

				?

									pip3 install scrapy-splash

2、splash lua脚本

运行splash服务后，通过web页面访问服务的8050端口如:http://localhost:8050即可看到其web页面，如下图：

python3之Splash的具体使用

上面有个输入框，默认是http://google.com，我们可以换成想要渲染的网页如：https://www.baidu.com然后点击render me按钮开始渲染，页面返回结果包括渲染截图、har加载统计数据、网页源代码:

python3之Splash的具体使用

从har中可以看到，splash执行了整个页面的渲染过程，包括css、javascript的加载等，通过返回结果可以看到它分别对应搜索框下面的脚本文件中return部分的三个返回值，html、png、har：

				?

									function main(splash, args)

									  assert(splash:go(args.url))

									  assert(splash:wait(0.5))

									  return {

									    html = splash:html(),

									    png = splash:png(),

									    har = splash:har(),

									  }

									end

这个脚本是使用lua语言写的，它首先使用go()方法加载页面，wait()方法等待加载时间，然后返回源码、截图和har信息。

现在我们修改下它的原脚本，访问www.baidu.com，通过javascript脚本，让它返回title，然后执行：

				?

									function main(splash, args)

									assert(splash:go("https://www.baidu.com"))

									assert(splash:wait(0.5))

									local title = splash:evaljs("document.title")  

									return {

									title = title

									}

									end

									#返回结果：

									splash response: object

									title: "百度一下，你就知道"

由此可以确定splash渲染页面的过程是通过此入口脚本来实现的，那么我们可以修改此脚本来满足我们对抓取页面的分析和结果返回，但此函数但名称必须是main()，它返回的结果是一个字典形式也可以返回字符串形式的内容：

				?

									function main(splash)

									  return {

									    hello="world"

									  }

									end

									#返回结果

									splash response: object

									hello: "world"

									function main(splash)

									  return "world"

									end

									#返回结果

									splash response: "world"

3、splash对象的属性与方法

在前面的例子中，main()方法的第一参数是splash，这个对象它类似于selenium中的webdriver对象，可以调用它的属性和方法来控制加载规程，下面介绍一些常用的属性：

splash.args：该属性可以获取加载时陪在的参数，如url，如果为get请求，它可以获取get请求参数，如果为post请求，它可以获取表单提交的数据，splash.args可以使用函数的第二个可选参数args来进行访问

				?

									function main(splash,args)

									    local url = args.url

									end

									#上面的第二个参数args就相当于splash.args属性，如下代码与上面是等价的

									function main(splash)

									   local url=splash.args.url

									end

splash.js_enabled：启用或者禁用页面中嵌入的javascript代码的执行，默认为true，启用javascript执行

splash.resource_timeout：设置网络请求的默认超时，以秒为单位，如设置为0或nil则表示无超时：splash.resource_timeout=nil

splash.images_enabled：启用或禁用图片加载，默认情况下是加载的：splash.images_enabled=true

splash.plugins_enabled：启用或禁用浏览器插件，默认为禁止：splash.plugins_enabled=false

splash.scroll_position：获取和设置主窗口的当前位置：splash.scroll_position={x=50,y=600}

				?

									function main(splash, args)

									  assert(splash:go('https://www.toutiao.com'))

									  splash.scroll_position={y=400}

									  return {

									    png = splash:png()

									  }

									end

									#它会向下滚动400像素来获取图片

splash.html5_media_enabled：启用或禁用html5媒体,包括html5视频和音频(例如<video>元素播放)

splash对象的方法：

splash:go() ：该方法用来请求某个链接，而且它可以模拟get和post请求，同时支持传入请求头，表单等数据，用法如下：

				?

									ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="get", body=nil, formdata=nil}

参数说明：url为请求的url，baseurl为可选参数表示资源加载相对路径，headers为可选参数，表示请求头，http_method表示http请求方法的字符串默认为get,body为使用post时发送表单数据，使用的content-type为application/json，formdata默认为空，post请求时的表单数据，使用的content-type为application/x-www-form-urlencoded

该方法返回结果是ok和reason的组合，如果ok为空则代表网页加载错误，reason变量中会包含错误信息

				?

									function main(splash, args)

									  local ok, reason = splash:go{"http://httpbin.org/post", http_method="post", body="name=germey"}

									  if ok then

									        return splash:html()

									  end

									end

splash.wait() ：控制页面的等待时间

ok,reason=splash:wait{time,cancel_on_redirect=false,cancel_on_error=true}

tiem为等待的秒数，cancel_on_redirect表示发生重定向就停止等待，并返回重定向结果，默认为false，cancel_on_error默认为false，表示如果发生错误就停止等待

返回结果同样是ok和reason的组合

				?

									function main(splash, args)

									  splash:go("https://www.toutiao.com")

									  local ok reason = splash:wait(1)

									  return {

									    ok=ok,

									    reason=reason

									  }

									end

									#返回true说明返回页面成功

splash:jsfunc()
lua_func = splash:jsfunc(func)
此方法可以直接调用javascript定义的函数，但所调用的函数需要用双中括号包围，它相当于实现了javascript方法到lua脚本到转换，全局的javascript函数可以直接包装

				?

									function main(splash, args)

									  local get_div_count = splash:jsfunc([[

									  function () {

									    var body = document.body;

									    var divs = body.getelementsbytagname('div');

									    return divs.length;

									  }

									  ]])

									  splash:go("https://www.baidu.com")

									  return ("there are %s divs"):format(

									    get_div_count())

									end

									#

									splash response: "there are 21 divs"

splash.evaljs() ：在页面上下文中执行javascript代码段并返回最后一个语句的结果

				?

									local title = splash:evaljs("document.title")

									#返回页面标题

splash:runjs() ：在页面上下文中运行javascript代码，同evaljs差不多，但它更偏向于执行某些动作或声明函数

				?

									function main(splash, args)

									  splash:go("https://www.baidu.com")

									  splash:runjs("foo = function() { return 'bar' }")

									  local result = splash:evaljs("foo()")

									  return result

									end

splash:autoload() ：将javascript设置为在每个页面加载时自动加载

ok,reason=splash:autoload{source_or_url,source=nil,url=nil}

参数：

source_or_url - 包含javascript源代码的字符串或用于加载javascript代码的url;s
ource - 包含javascript源代码的字符串;
url - 从中加载javascript源代码的url

此方法只加载javascript代码或库，不执行操作，如果要执行操作可以调用evaljs()或runjs()方法

				?

									function main(splash, args)

									  splash:autoload([[

									    function get_document_title(){

									      return document.title;

									    }

									  ]])

									  splash:go("https://www.baidu.com")

									  return splash:evaljs("get_document_title()")

									end

									#加载js库文件

									function main(splash, args)

									  assert(splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js"))

									  assert(splash:go("https://www.taobao.com"))

									  local version = splash:evaljs("$.fn.jquery")

									  return 'jquery version: ' .. version

									end

splash:call_later ：通过设置定时任务和延迟时间来实现任务延时执行

timer=splash:call_later(callback,delay) ：callback运行的函数，delay延迟时间

				?

									function main(splash, args)

									  local snapshots = {}

									  local timer = splash:call_later(function()

									    snapshots["a"] = splash:png()

									    splash.scroll_position={y=500}

									    splash:wait(1.0)

									    snapshots["b"] = splash:png()

									  end, 2)

									  splash:go("https://www.toutiao.com")

									  splash:wait(3.0)

									  return snapshots

									end

									#等待2秒后执行截图然后再等待3秒后执行截图

splash:http_get() ：发送http get请求并返回相应

response=splash:http_get{url,headers=nil,follow_redirects=true} ：url要加载的url，headers添加http头，follw_redirects是否启动自动重定向默认为true

				?

									local reply = splash:http_get("http://example.com")

									#返回一个响应对象，不会讲结果返回到浏览器

splash:http_post ：发送post请求

response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil}

dody指定表单数据

				?

									function main(splash, args)

									  local treat = require("treat")

									  local json = require("json")

									  local response = splash:http_post{"http://httpbin.org/post",     

									      body=json.encode({name="germey"}),

									      headers={["content-type"]="application/json"}

									    }

									    return {

									    html=treat.as_string(response.body),

									    url=response.url,

									    status=response.status

									    }

									end

									#

									html:{"args":{},"data":"{\"name\": \"germey\"}","files":{},"form":{},"headers":{"accept-encoding":"gzip, deflate","accept-language":"en,*","connection":"close","content-length":"18","content-type":"application/json","host":"httpbin.org","user-agent":"mozilla/5.0 (x11; linux x86_64) applewebkit/602.1 (khtml, like gecko) splash version/9.0 safari/602.1"},"json":{"name":"germey"},"origin":"221.218.181.223","url":"http://httpbin.org/post"}

									status: 200

									url: http://httpbin.org/post

splash:set_content() ：设置当前页面的内容

ok,reason=splash:set_content{data,mime_type="text/html;charset=utf-8",baseurl=""}

				?

									function main(splash)

									    assert(splash:set_content("<html><body><h1>hello</h1></body></html>"))

									    return splash:png()

									end

splash:html() ：获取网页的源代码，结果为字符串

				?

									function main(splash, args)

									  splash:go("https://httpbin.org/get")

									  return splash:html()

									end

splash:png() ：获取png格式的网页截图

splash:jpeg() ：获取jpeg格式的网页截图

splash:har() ：获取页面加载过程描述

splash:url() ：获取当前正在访问的url

splash:get_cookies() ：获取当前页面的cookies

splash:add_cookie() ：为当前页面添加cookie

				?

									function main(splash)

									    splash:add_cookie{"sessionid", "237465ghgfsd", "/", domain="http://example.com"}

									    splash:go("http://example.com/")

									    return splash:get_cookies()

									end

									#

									splash response: array[1]

									0: object

									domain: "http://example.com"

									httponly: false

									name: "sessionid"

									path: "/"

									secure: false

									value: "237465ghgfsd"

splash:clear_cookies() ：清除所有的cookies

splash:delete_cookies{name=nil,url=nil} 删除指定的cookie

splash:get_viewport_size() ：获取当前浏览器页面的大小，即宽高

splash:set_viewport_size(width,height) ：设置当前浏览器页面的大小，即宽高

splash:set_viewport_full() ：设置浏览器全屏显示

splash:set_user_agent() ：覆盖设置请求头的user-agent

splash:get_custom_headers(headers) ：设置请求头

				?

									function main(splash)

									  splash:set_custom_headers({

									     ["user-agent"] = "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/67.0.3396.62 safari/537.36",

									     ["site"] = "httpbin.org",

									  })

									  splash:go("http://httpbin.org/get")

									  return splash:html()

									end

splash:on_request(callback) ：在http请求之前注册要调用的函数

splash:get_version() ：获取splash版本信息

splash:mouse_press() ：触发鼠标按下事件

splash:mouse_release() ：触发鼠标释放事件

splash:send_keys() ：发送键盘事件到页面上下文，如发送回车键：splash:send_keys("key_enter")

splash:send_text() ：将文本内容发送到页面上下文

splash:select() ：选中符合条件的第一个节点，如果有多个节点符合条件，则只会返回一个，其参数是css选择器

				?

									function main(splash)

									  splash:go("https://www.baidu.com/")

									  input = splash:select("#kw")

									  input:send_text('splash')

									  splash:wait(3)

									  return splash:png()

									end

splash:select_all() ：选中所有符合条件的节点，其参数是css选择器

				?

									function main(splash)

									  local treat = require('treat')

									  assert(splash:go("https://www.zhihu.com"))

									  assert(splash:wait(1))

									  local texts = splash:select_all('.contentlayout-maincolumn .contentitem-title')

									  local results = {}

									  for index, text in ipairs(texts) do

									    results[index] = text.node.textcontent

									  end

									  return treat.as_array(results)

									end

									#返回所有节点下的文本内容

splash:mouse_click() ：出发鼠标单击事件

				?

									function main(splash)

									  splash:go("https://www.baidu.com/")

									  input = splash:select("#kw")

									  input:send_text('splash')

									  submit = splash:select('#su')

									  submit:mouse_click()

									  splash:wait(3)

									  return splash:png()

									end

其他splash scripts的属性与方法请参考官方文档：

4、响应对象

响应对象是由splash方法返回的回调信息，如splash:http_get()或splash:http_post()，会被传递给回调splash:on_response和splash:on_response_headers，它们包括的响应信息：

response.url：响应的url

response.status:响应的http状态码

response.ok：成功返回true否则返回false

response.headers：返回http头信息

response.info：具有har响应格式的响应数据表

response.body：返回原始响应主体信息为二进制对象，需要使用treat.as_string转换为字符串

resonse.request：响应的请求对象

response.abort：终止响应

5、元素对象

元素对象包装javascript dom节点，创建某个方法返回任何类型的dom节点，如node，element，htmlelement等，splash:select和splash:select_all将返回元素对象

element:mouse_click() 出发元素上的鼠标单击事件

element:mouse_hover()在元素上触发鼠标悬停事件

elemnet:styles() 返回元素的计算样式

element:bounds() 返回元素的边界客户端矩形

element:png()以png格式返回元素的屏幕截图

element:jpeg() 以jpeg格式返回元素的屏幕截图

element:visible() 检查元素是否可见

element:focused() 检查元素是否具有焦点

element:text() 从元素中获取文本信息

element:info() 获取元素的详细信息

element:field_value() 获取field元素的值,如input,select,textarea,button

element:form_values(values='auto'/'list'/'first') 如果元素类型是表单，则返回带有表单的表，返回类型有三种格式

element:fill(values) 使用提供的值填写表单

element:send_keys(keys) 将键盘事件发送到元素，如发送回车send_keys('key_enter')，其他键请参考：

element:send_text() 发送字符串到元素

element:submit()提交表单元素

element:exists()检查dom中元素是否存在

element属性：

element.node 它具有所有公开的元素dom方法和属性，但不包括splash定义的方法和属性

element.inner_id 表示元素id

外部继承的支持的dom属性：（有一些是只读的）

从htmlelement继承的属性:

accesskey
accesskeylabel (read-only)
contenteditable
iscontenteditable (read-only)
dataset (read-only)
dir
draggable
hidden
lang
offsetheight (read-only)
offsetleft (read-only)
offsetparent (read-only)
offsettop (read-only)
spellcheck
style - a table with styles which can be modified
tabindex
title
translate

从element继承的属性:

attributes (read-only) - a table with attributes of the element
classlist (read-only) - a table with class names of the element
classname
clientheight (read-only)
clientleft (read-only)
clienttop (read-only)
clientwidth (read-only)
id
innerhtml
localename (read-only)
namespaceuri (read-only)
nextelementsibling (read-only)
outerhtml
prefix (read-only)
previouselementsibling (read-only)
scrollheight (read-only)
scrollleft
scrolltop
scrollwidth (read-only)
tabstop
tagname (read-only)

从node继承的属性:

baseuri (read-only)
childnodes (read-only)
firstchild (read-only)
lastchild (read-only)
nextsibling (read-only)
nodename (read-only)
nodetype (read-only)
nodevalue
ownerdocument (read-only)
parentnode (read-only)
parentelement (read-only)
previoussibling (read-only)
rootnode (read-only)
textcontent

6、splash http api调用

splash通过http api控制来发送get请求或post表单数据，它提供了这些接口，只需要在请求时传递相应的参数即可获得不同的内容，下面来介绍下这些接口

(1)render.html 它返回javascript渲染页面的html代码

参数：

url：要渲染的网址，str类型

baseurl：用于呈现页面的基本url

timeout：渲染的超时时间默认为30秒

resource_timeout：单个网络请求的超时时间

wait：加载页面后等待更新的时间默认为0

proxy：代理配置文件名称或代理url，格式为：[protocol://][user:password@]proxyhost[:port])

js：javascript配置

js_source：在页面中执行的javascript代码

filtrs：以逗号分隔的请求过滤器名称列表

allowed_domains：允许的域名列表

images：为1时下载图像，为0时不下载图像，默认为1

headers：设置的http标头，json数组

body：发送post请求的数据

http_method：http方法，默认为get

html5_media：是否启用html5媒体，值为1启用，0为禁用，默认为0

				?

									import requests

									url='http://172.16.32.136:8050/'

									response=requests.get(url+'render.html?url=https://www.baidu.com&wait=3&images=0')

									print(response.text)  #返回网页源代码

（2）render.png 此接口获取网页的截图png格式

				?

									import requests

									url='http://172.16.32.136:8050/'

									#指定图像宽和高

									response=requests.get(url+'render.png?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1')

									with open('taobao.png','wb') as f:

									    f.write(response.content)

（3）render.jpeg 返回jpeg格式截图

				?

									import requests

									url='http://172.16.32.136:8050/'

									response=requests.get(url+'render.jpeg?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1')

									with open('taobao.jpeg','wb') as f:

									    f.write(response.content)

（4）render.har 此接口用于获取页面加载的har数据

				?

									import requests

									url='http://172.16.32.136:8050/'

									response=requests.get(url+'render.har?url=https://www.jd.com&wait=5')

									print(response.text)

（5）render.json 此接口包含了前面接口的所有功能，返回结果是json格式

参数：

html：是否在输出中包含html，html=1时包含html内容，为0时不包含，默认为0

png：是否包含png截图，为1包含为0不包含默认为0

jpeg：是否包含jpeg截图，为1包含为0不包含默认为0

iframes：是否在输出中包含子帧的信息，默认为0

script：是否输出包含执行的javascript语句的结果

console：是否输出中包含已执行的javascript控制台消息

history：是否包含网页主框架的请求与响应的历史记录

har：是否输出中包含har信息

				?

									import requests

									url='http://172.16.32.136:8050/'

									response=requests.get(url+'render.json?url=https://httpbin.org&html=1&png=1&history=1&har=1')

									print(response.text)

（6）execute 用此接口可以实现与lua脚本的对接，它可以实现与页面的交互操作

参数：

lua_source：lua脚本文件

timeout：设置超时

allowed_domains：指定允许的域名列表

proxy：指定代理

filters：指定筛选条件

				?

									import requests

									from urllib.parse import quote

									lua='''

									function main(splash)

									    return 'hello'

									end

									'''

									url='http://172.16.32.136:8050/execute?lua_source='+quote(lua)

									response=requests.get(url)

									print(response.text)

通过lua脚本获取页面的body,url和状态码：

				?

									import requests

									from urllib.parse import quote

									lua='''

									function main(splash,args)

									    local treat=require("treat")

									    local response=splash:http_get("http://httpbin.org/get")

									    return {

									        html=treat.as_string(response.body),

									        url=response.url,

									        status=response.status

									    }

									end

									'''

									url='http://172.16.32.136:8050/execute?lua_source='+quote(lua)

									response=requests.get(url)

									print(response.text)

									#

									{"status": 200, "html": "{\"args\":{},\"headers\":{\"accept-encoding\":\"gzip, deflate\",\"accept-language\":\"en,*\",\"connection\":\"close\",\"host\":\"httpbin.org\",\"user-agent\":\"mozilla/5.0 (x11; linux x86_64) applewebkit/602.1 (khtml, like gecko) splash version/9.0 safari/602.1\"},\"origin\":\"221.218.181.223\",\"url\":\"http://httpbin.org/get\"}\n", "url": http://httpbin.org/get}

7、实例

抓取jd python书籍数据：

				?

									#!/usr/bin/env python

									# -*- coding: utf-8 -*-

									# @time    : 2018/7/9 13:33

									# @author  : py.qi

									# @file    : jd.py

									# @software: pycharm

									import re

									import requests

									import pymongo

									from pyquery import pyquery as pq

									client=pymongo.mongoclient('localhost',port=27017)

									db=client['jd']

									def page_parse(html):

									    doc=pq(html,parser='html')

									    items=doc('#j_goodslist .gl-item').items()

									    for item in items:

									        if item('.p-img img').attr('src'):

									            image=item('.p-img img').attr('src')

									        else:

									            image=item('.p-img img').attr('data-lazy-img')

									        texts={

									            'image':'https:'+image,

									            'price':item('.p-price').text()[:6],

									            'title':re.sub('\n','',item('.p-name').text()),

									            'commit':item('.p-commit').text()[:-3],

									        }

									        yield texts

									def save_to_mongo(data):

									    if db['jd_collection'].insert(data):

									        print('保存到mongodb成功',data)

									    else:

									        print('mongodb存储错误',data)

									def main(number):

									    url='http://192.168.146.140:8050/render.html?url=https://search.jd.com/search?keyword=python&page={}&wait=1&images=0'.format(number)

									    response=requests.get(url)

									    data=page_parse(response.text)

									    for i in data:

									        save_to_mongo(i)

									        #print(i)

									if __name__ == '__main__':

									    for number in range(1,200,2):

									        print('开始抓取第{}页'.format(number))

									        main(number)