[問題] 新手爬蟲（被擋的問題） etudiant PTT批踢踢實業坊

[問題] 新手爬蟲（被擋的問題）

作者: etudiant (weiwei) 2022-10-09 22:41:29

板上的大大們好，小弟又來請教問題了，最近在爬群眾募資平台的資料，但很常會中間好
幾頁爬不到東西，過一陣子又有了，想請問大家是什麼問題...不確定是不是跟下圖這個
檢查網路連線的有關，有時候我換頁點很快也會遇到QQ 如果有關的話，想請問是否有解
決的辦法，謝謝！
https://i.imgur.com/GdWEijn.jpg
附上我的程式碼：
（當初邏輯上是先去外面的頁面抓完每頁的id名稱再套進去網址去找每個項目的資訊，之
後再轉成Excel)
import requests
import bs4
import time
import random
import pandas as pd
collect_title=[]
collect_category=[]
collect_goal=[]
collect_final=[]
collect_people=[] #空列表之後存資料
def get_page_info(URL):
headers = {'cookie':'age_checked_for=12925;','user-agent': 'Mozilla/5.0 (W
indows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.
0.0.0 Safari/537.36'}
#含第九頁一項18禁的cookie
response = requests.get(URL,headers=headers)
soup = bs4.BeautifulSoup(response.text,"html.parser")
data=soup.find_all('div','text-black project text-neutral-600 mx-4 mb-4 pb
-12 relative h-full')
for t in data:
href = t.find("a","block").attrs["href"] #取a標籤下href後面的網址
link ="https://www.zeczec.com"+href
response_2 = requests.get(link,headers=headers)
soup_2 = bs4.BeautifulSoup(response_2.text,"html.parser") #解析取下的
網址中的網頁內容
main_info = soup_2.find_all('div','container lg:my-8')
for i in main_info:
#category = i.find('a','underline text-neutral-600 font-bold inlin
e-block').text.strip()
category = i.find_all('a','underline text-neutral-600 font-bold in
line-block')[1].text.strip()
title = i.find('h2', 'text-xl mt-2 mb-1 leading-relaxed').text.str
ip()
final_cash = i.find('div','text-2xl font-bold js-sum-raised whites
pace-nowrap leading-relaxed').text.strip()
goal_cash = i.find('div','text-xs leading-relaxed').text.strip()
people = i.find('span','js-backers-count').text.strip()
final='類別:{} 標題:{} 目標:{} 實際:{} 贊助人數:{}'.format(categor
y,title,goal_cash[6:],final_cash[3:],people)
print(final)
collect_category.append(category)
collect_title.append(title)
collect_goal.append(goal_cash[6:])
collect_final.append(final_cash[3:])
collect_people.append(people) #丟入collect列表
time.sleep(2)
for i in range(1,13,1):
print("第"+str(i)+"頁")
URL="https://www.zeczec.com/categories?category=1&page="+str(i)+"&type=0"
get_page_info(URL)
delay_time=[3,7,8,5]
delay=random.choice(delay_time)
time.sleep(delay)
print(len(collect_goal)) #計算抓了幾筆
#print(collect_final)
#print(collect_people)
col1 = "類別"
col2 = "標題"
col3 = "目標金額"
col4 = "實際金額"
col5 = "贊助人數"
data = pd.DataFrame({col1:collect_category,col2:collect_title,col3:collect_goa
l,col4:collect_final,col5:collect_people}) #三欄放list
data.to_excel('音樂.xlsx', sheet_name='sheet1', index=False)
遇到問題的畫面：
突然好幾頁不能抓到這樣，不知道是requests 還是太頻繁嗎？
https://i.imgur.com/PaN0z1N.jpg

作者: surimodo (好吃棉花糖) 2022-10-09 23:59:00

就是太頻繁阿 time.sleep 數字調高一點

作者: cocoaswifty (coco) 2022-10-09 23:59:00

你最後不就說出答案了

繼續閱讀

[問題] 月經文請教denchang [問題] 集保戶股權分散表無法爬取toyboy [問題] 請問資料前處理問題(空格和tab混用的dataWeiU [問題] csv寫入問題chacha7202 [問題] requests等同於C#的webrequest嗎?Federer5566 Re: [問題] 物件學習方式zerof [問題] 物件學習方式koconut [問題] 爬蟲如何做到多進程異步處理surimodo Re: [問題] 簡單寫一個PicPick的外掛程式Hsins [問題] Linebot，回傳及時爬蟲資料(續)crazystyle63