pyspider爬虫爬取电影网站代码

pyspider爬虫爬取电影网站代码

2017-11-30 / 0 评论 / 177 阅读 / 已收录
温馨提示:
本文最后更新于2021年10月27日,已超过1257天没有更新,若内容或图片失效,请留言反馈。
#!/usr/bin/env python# -*- encoding: utf-8 -*-
# Created on 2017-11-30 15:46:23
# Project: ttwanda_3
from pyspider.libs.base_handler import *
import re
import json
from pyspider.libs.utils import md5string
class Handler(BaseHandler):
    crawl_config = {
    }
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.ttwanda.com', callback=self.index_page)
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match(u"http://www.ttwanda.com/film/page/\d+|http://www.ttwanda.com/film$", each.attr.href):
                self.result={}
                self.crawl(each.attr.href, callback=self.film_list_page, save=self.result)
    def film_list_page(self, response):
        for each in response.doc('article.u-movie').items():
            self.result = response.save
            self.result['poster'] = each('img').attr['data-original']
            self.result['star'] = each('.pingfen').text()
            self.crawl(each('.list-poster a[href^="http"]').attr.href, callback=self.film_detail_page, save=self.result,priority=1)
        self.crawl(response.doc('.next-page a').attr('href'), callback=self.index_page)
    def film_detail_page(self, response):
        self.result = response.save
        for each in response.doc('.mplay-list a').items():
            self.crawl(each.attr.href, callback=self.film_video_page, save=self.result)
    def film_video_page(self, response):
        self.result = response.save
        self.result['title'] = self.response.doc('.player_box>strong').text()
        self.result['url'] = self.response.url
        #print(self.get_taskid(self.task))
        for each in response.doc('script').items():
            self.search = re.search(r'var play_type="(\w+)",vid="(\w+)";',each.text())
            if self.search:
                self.result['vtype'] = self.search.group(1)
                self.result['vid'] = self.search.group(2)
        return self.result
0

评论 (0)

OωO
  • ::(呵呵)
  • ::(哈哈)
  • ::(吐舌)
  • ::(太开心)
  • ::(笑眼)
  • ::(花心)
  • ::(小乖)
  • ::(乖)
  • ::(捂嘴笑)
  • ::(滑稽)
  • ::(你懂的)
  • ::(不高兴)
  • ::(怒)
  • ::(汗)
  • ::(黑线)
  • ::(泪)
  • ::(真棒)
  • ::(喷)
  • ::(惊哭)
  • ::(阴险)
  • ::(鄙视)
  • ::(酷)
  • ::(啊)
  • ::(狂汗)
  • ::(what)
  • ::(疑问)
  • ::(酸爽)
  • ::(呀咩爹)
  • ::(委屈)
  • ::(惊讶)
  • ::(睡觉)
  • ::(笑尿)
  • ::(挖鼻)
  • ::(吐)
  • ::(犀利)
  • ::(小红脸)
  • ::(懒得理)
  • ::(勉强)
  • ::(爱心)
  • ::(心碎)
  • ::(玫瑰)
  • ::(礼物)
  • ::(彩虹)
  • ::(太阳)
  • ::(星星月亮)
  • ::(钱币)
  • ::(茶杯)
  • ::(蛋糕)
  • ::(大拇指)
  • ::(胜利)
  • ::(haha)
  • ::(OK)
  • ::(沙发)
  • ::(手纸)
  • ::(香蕉)
  • ::(便便)
  • ::(药丸)
  • ::(红领巾)
  • ::(蜡烛)
  • ::(音乐)
  • ::(灯泡)
  • ::(开心)
  • ::(钱)
  • ::(咦)
  • ::(呼)
  • ::(冷)
  • ::(生气)
  • ::(弱)
  • ::(狗头)
泡泡
阿鲁
颜文字
取消