[問題] 新手爬蟲(被擋的問題)

作者: etudiant (weiwei)   2022-10-09 22:41:29
板上的大大們好,小弟又來請教問題了,最近在爬群眾募資平台的資料,但很常會中間好
幾頁爬不到東西,過一陣子又有了,想請問大家是什麼問題...不確定是不是跟下圖這個
檢查網路連線的有關,有時候我換頁點很快也會遇到QQ 如果有關的話,想請問是否有解
決的辦法,謝謝!
https://i.imgur.com/GdWEijn.jpg
附上我的程式碼:
(當初邏輯上是先去外面的頁面抓完每頁的id名稱再套進去網址去找每個項目的資訊,之
後再轉成Excel)
import requests
import bs4
import time
import random
import pandas as pd
collect_title=[]
collect_category=[]
collect_goal=[]
collect_final=[]
collect_people=[] #空列表之後存資料
def get_page_info(URL):
headers = {'cookie':'age_checked_for=12925;','user-agent': 'Mozilla/5.0 (W
indows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.
0.0.0 Safari/537.36'}
#含第九頁一項18禁的cookie
response = requests.get(URL,headers=headers)
soup = bs4.BeautifulSoup(response.text,"html.parser")
data=soup.find_all('div','text-black project text-neutral-600 mx-4 mb-4 pb
-12 relative h-full')
for t in data:
href = t.find("a","block").attrs["href"] #取a標籤下href後面的網址
link ="https://www.zeczec.com"+href
response_2 = requests.get(link,headers=headers)
soup_2 = bs4.BeautifulSoup(response_2.text,"html.parser") #解析取下的
網址中的網頁內容
main_info = soup_2.find_all('div','container lg:my-8')
for i in main_info:
#category = i.find('a','underline text-neutral-600 font-bold inlin
e-block').text.strip()
category = i.find_all('a','underline text-neutral-600 font-bold in
line-block')[1].text.strip()
title = i.find('h2', 'text-xl mt-2 mb-1 leading-relaxed').text.str
ip()
final_cash = i.find('div','text-2xl font-bold js-sum-raised whites
pace-nowrap leading-relaxed').text.strip()
goal_cash = i.find('div','text-xs leading-relaxed').text.strip()
people = i.find('span','js-backers-count').text.strip()
final='類別:{} 標題:{} 目標:{} 實際:{} 贊助人數:{}'.format(categor
y,title,goal_cash[6:],final_cash[3:],people)
print(final)
collect_category.append(category)
collect_title.append(title)
collect_goal.append(goal_cash[6:])
collect_final.append(final_cash[3:])
collect_people.append(people) #丟入collect列表
time.sleep(2)
for i in range(1,13,1):
print("第"+str(i)+"頁")
URL="https://www.zeczec.com/categories?category=1&page="+str(i)+"&type=0"
get_page_info(URL)
delay_time=[3,7,8,5]
delay=random.choice(delay_time)
time.sleep(delay)
print(len(collect_goal)) #計算抓了幾筆
#print(collect_final)
#print(collect_people)
col1 = "類別"
col2 = "標題"
col3 = "目標金額"
col4 = "實際金額"
col5 = "贊助人數"
data = pd.DataFrame({col1:collect_category,col2:collect_title,col3:collect_goa
l,col4:collect_final,col5:collect_people}) #三欄放list
data.to_excel('音樂.xlsx', sheet_name='sheet1', index=False)
遇到問題的畫面:
突然好幾頁不能抓到這樣,不知道是requests 還是太頻繁嗎?
https://i.imgur.com/PaN0z1N.jpg
作者: surimodo (好吃棉花糖)   2022-10-09 23:59:00
就是太頻繁阿 time.sleep 數字調高一點
作者: cocoaswifty (coco)   2022-10-09 23:59:00
你最後不就說出答案了

Links booklink

Contact Us: admin [ a t ] ucptt.com