scrapy xpath extraction 以及其編碼的問題 stevec PTT批踢踢實業坊

scrapy xpath extraction 以及其編碼的問題

作者: stevec (steve) 2014-11-29 19:20:32

有點不曉得為什麼,想請各位大大看一下
下面的程式碼只要是想利用scrapy 裡面的xpath extract一些我想要的info
raw_html_article_content_ 是儲存我想extract的部分資訊
raw 是儲存範圍比較大的部分
所以理論上raw會包含raw_html_article_content_ 的資訊
可是raw包含的部分會有點跟raw_html_article_content_裡面的不一樣
例如:
raw: 結婚並無Z>B (這跟chrom瀏覽器打開source code的看到的是一樣的)
raw_html_article_content_ : 結婚並無Z>B
我要怎麼讓raw裡面儲存的跟raw_html_article_content_的一樣啊？
ps:環境win 7, python 2.7,scrappy 1.4
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
import urllib
import urllib2
address = "http://www.ptt.cc/bbs/Boy-Girl/M.1416362560.A.881.html"
response = urllib2.urlopen(address)
html = response.read()
html_response = HtmlResponse( address, body=html)
sel = Selector(html_response)
recog_assist_word = u"※ 文章網址: "
xpath = """/html/body/div[@id="main-container"]/div[@id="main-content"]/
span[@class="f2" and text()="%s"][last()]/preceding-sibling::node()"""
% recog_assist_word
raw_html_article_content_ = sel.xpath(xpath).extract()
raw_html_article_content_ = "".join([_ for _ in raw_html_article_content_])
raw=sel.xpath(u"""/html""").extract()[0]
print raw_html_article_content_
print raw

作者: dritchie (卍~邁斯納效應~卍) 2014-11-30 01:27:00

那個編碼叫HTML entity

作者: stevec (steve) 2014-11-30 11:03:00

感謝大大,可是在python裡要怎麼樣讓name entities顯示正常呢？為什麼scrapy有時候會幫忙修正,有時候又不會呢？這個眉角在哪啊？

繼續閱讀

[問題] 變數範圍Arim Re: [問題]如何讓os.system執行多筆指令uranusjr [問題]如何讓os.system執行多筆指令arnold0613 [問題] 如何將照片使用介面讓user切割成方形sobonbon [問題] 安裝gensim包出現問題OoShiunoO [問題] 請教區網開啟和停用 ?Love1019 Re: [問題] Django POST部份資料呈現在redirect pagewalelile Re: [問題]Django Transaction error MacPerson [心得] iPython 在win8 底下成功安裝的套件sjgau Re: [問題] Django POST部份資料呈現在redirect pageuranusjr