i trying latest review form google play store. following question getting latest reviews here
method specified in above link's answer works fine scrapy shell when try in crawler gets ignored.
code snippet:
import re import sys import time import urllib import urlparse scrapy import spider scrapy.spider import basespider scrapy.http import request, formrequest scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.lxmlhtml import lxmllinkextractor play.items import playapp class playspider(crawlspider): name = "play" allowed_domains = ["play.google.com"] start_urls = [ "https://play.google.com/store/apps" ] rules = ( rule(lxmllinkextractor(allow=('/store/apps$', )), callback='parsecategory',follow=true), ) def parsecategory(self, response): """ gets categories store home page call parselinks each category """ #something here...... yield request(categoryapps, callback=self.parselinks) def parselinks(self, response): ''' links category page , pasess individual links parseapp function. ''' #something here yield request(link, callback=self.parseapp) def parseapp(self, response): ''' parses apps page info app ''' #application page parsing ...... frmdata = {"id": "com.supercell.boombeach", "reviewtype": '0', "reviewsortorder": '0', "pagenum":'0'} url = "https://play.google.com/store/getreviews" yield formrequest(url, callback=self.parse_data, formdata=frmdata) yield app def parse_data(self, response): # stuff data... print '\n\n---------------i here------------------\n\n'
this function parse_data never called. asked on #scrapy irc , few other places no help. please me this.
this debug response on terminal:
debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks) 2015-06-03 13:56:07+0530 [play] debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo) 2015-06-03 13:56:07+0530 [play] debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms) 2015-06-03 13:56:07+0530 [play] debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary) 2015-06-03 13:56:07+0530 [play] debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster) 2015-06-03 13:56:08+0530 [play] debug: crawled (200) <post https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.hinditranslate)
so post request indeed getting sent callback method not called.
seems haven't changing id
in form data.
def parseapp(self, response): apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract())) url = "https://play.google.com/store/getreviews" app in apps: _id = app.strip('/store/apps/details?id=') form_data = {"id": _id, "reviewtype": '0', "reviewsortorder": '0', "pagenum":'0'} sleep(5) yield formrequest(url=url, formdata=form_data, callback=self.parse_data) def parse_app(self, response): response_data = re.findall("\[\[.*", response.body) if response_data: try: text = json.loads(response_data[0] + ']') sell = selector(text=text[0][2]) except: pass # whatever want extract using sell.xapth('your_xpath_here')
a sample review after cleaning data getting
<div class="single-review"> <a href="/store/people/details?id=106726831005267540508"> <img class="author-image" alt="lorence gerona avatar image" src="https://lh3.googleusercontent.com/ufp_tstjbouy7kue5xasga=w48-c-h48"> </a> <div class="review-header" data-expand-target="" data-reviewid="gp:aoqptohnsexa_p6jfrjd6hf5h71fpy91tnaeodjtfitu-zpfki9znysnp1hecgfpgefu9xqwjl_j-03tx0e9lw"> <div class="review-info"> <span class="author-name"> <a href="/store/people/details?id=106726831005267540508">lorence gerona</a> </span> <span class="review-date">3 june 2015</span> <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&reviewid=z3a6qu9xcfrpsg5zrxhhx1a2skzsskq2sey1adcxznbzotf0tmfft0rqdgzpvhutelbga2k5wm5zc05wmuhfy0dgcedfznu5ehf3skxfai0wm1r4mgu5bhc" title="link review"></a> <div class="review-source" style="display:none"> </div> <div class="review-info-star-rating"> <div class="tiny-star star-rating-non-editable-container" aria-label="rated 5 stars out of 5 stars"> <div class="current-rating" style="width: 100%;"> </div> </div> </div> </div> <div class="rate-review-wrapper"> <div class="play-button icon-button small rate-review" title="spam" data-rating="spam"> <div class="icon spam-flag"></div> </div> <div class="play-button icon-button small rate-review" title="helpful" data-rating="helpful"> <div class="icon thumbs-up"></div> </div> <div class="play-button icon-button small rate-review" title="unhelpful" data-rating="unhelpful"> <div class="icon thumbs-down"></div> </div> </div> </div> <div class="review-body"> <span class="review-title">team boom beach</span> amazing game can defeat hammerman <div class="review-link" style="display:none"> <a class="id-no-nav play-button tiny" href="#" target="_blank">full review</a> </div> </div> </div>
Comments
Post a Comment