[Python] Python爬虫完整代码拿走不谢

652 0
Honkers 2025-3-5 20:21:54 | 显示全部楼层 |阅读模式

对于新手做Python爬虫来说是有点难处的,前期练习的时候可以直接套用模板,这样省时省力还很方便。

使用Python爬取某网站的相关数据,并保存到同目录下Excel。

直接上代码:

  1. import re
  2. import urllib.error
  3. import urllib.request
  4. import xlwt
  5. from bs4 import BeautifulSoup
  6. def main():
  7. baseurl ="http://jshk.com.cn"
  8. datelist = getDate(baseurl)
  9. savepath=".\\jshk.xls"
  10. saveDate(datelist,savepath)
  11. # askURL("http://jshk.com.cn/")
  12. findlink = re.compile(r'<a href="(.*?)">')
  13. findimg = re.compile(r'<img.*src="(.*?)"',re.S)
  14. findtitle = re.compile(r'<span class="title">(.*)</span')
  15. findrating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span')
  16. findjudge = re.compile(r'<span>(\d*)人评价</span>')
  17. findinq= re.compile(r'<span class="inq">(.*)</span>')
  18. def getDate(baseurl):
  19. datalist =[]
  20. for i in range(0,10):
  21. url=baseurl+str(i*25)
  22. html=askURL(url)
  23. soup = BeautifulSoup(html,"html.parser")
  24. for item in soup.find_all('div',class_="item"):
  25. data = []
  26. item = str(item)
  27. link = re.findall(findlink,item)[0]
  28. data.append(link)
  29. img=re.findall(findimg,item)[0]
  30. data.append(img)
  31. title=re.findall(findtitle,item)[0]
  32. rating=re.findall(findrating,item)[0]
  33. data.append(rating)
  34. judge=re.findall(findjudge,item)[0]
  35. data.append(judge)
  36. inq=re.findall(findinq,item)
  37. if len(inq)!=0:
  38. inq=inq[0].replace("。","")
  39. data.append(inq)
  40. else:
  41. data.append(" ")
  42. print(data)
  43. datalist.append(data)
  44. print(datalist)
  45. return datalist
  46. def askURL(url):
  47. head = {
  48. "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
  49. request=urllib.request.Request(url,headers=head)
  50. html=""
  51. try:
  52. response=urllib.request.urlopen(request)
  53. html=response.read().decode("utf-8")
  54. # print(html)
  55. except urllib.error.URLError as e:
  56. if hasattr(e,"code"):
  57. print(e.code)
  58. if hasattr(e,"reason"):
  59. print(e.reason)
  60. return html
  61. def saveDate(datalist,savepath):
  62. workbook = xlwt.Workbook(encoding='utf-8')
  63. worksheet = workbook.add_sheet('电影',cell_overwrite_ok=True)
  64. col =("电影详情","图片","影片","评分","评价数","概况")
  65. for i in range(0,5):
  66. worksheet.write(0,i,col[i])
  67. for i in range(0,250):
  68. print("第%d条" %(i+1))
  69. data=datalist[i]
  70. for j in range(0,5):
  71. worksheet.write(i+1,j,data[j])
  72. workbook.save(savepath)
  73. if __name__ == '__main__':
  74. main()
  75. print("爬取完毕")
复制代码

直接复制粘贴就行。

若要更改爬取网站,则需要更改URL以及相应的html格式(代码中的“item”)。

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

Honkers

荣誉红客

关注
  • 4008
    主题
  • 36
    粉丝
  • 0
    关注
这家伙很懒,什么都没留下!

中国红客联盟公众号

联系站长QQ:5520533

admin@chnhonker.com
Copyright © 2001-2025 Discuz Team. Powered by Discuz! X3.5 ( 粤ICP备13060014号 )|天天打卡 本站已运行