python爬数据的能力毋庸置疑,下文从零开始介绍如何做个简单的爬虫爬取我爱我家的二手房数据
1、网址分析
1.1链接分析
我爱我家的杭州二手房的链接为https://hz.5i5j.com/zufang/o8q1ni/
其中
o8按最新发布排序;
q1为住宅;
i为页数,即第一页的链接为https://hz.5i5j.com/zufang/o8q1n1/;
因此只要循环遍历所要的页面就可以拿到原始数据
import requests
for i in range(1,你要的页数):
url='https://sz.5i5j.com/ershoufang/o8q1n'+str(i)+'/'
response=requests.get(url).text.encode("utf-8")
1.2网页分析
下图为我爱我家的页面及其原码:
可以得出
房子地址的位置在'.listCon>div>p>a:nth-child(2)';
房子单价的位置在'.listCon>div>div>p:nth-child(2)';
房子描述的位置在'.listCon>div>p:nth-child(1)';
房子总价的位置在'.listCon>div>div>p>strong';
发布时间的位置在'.listCon>div>p:nth-child(3)';
因此通过 BeautifulSoup模块 可以得在以上的信息
import requests
from bs4 import BeautifulSoup
for i in range(1,你要的页数):
url='https://sz.5i5j.com/ershoufang/o8q1n'+str(i)+'/'
response=requests.get(url).text.encode("utf-8")
#页面数据获取
soup=BeautifulSoup(response,'html.parser')
addrs=soup.select('.listCon>div>p>a:nth-child(2)')
unit_prices=soup.select('.listCon>div>div>p:nth-child(2)')
discriptions=soup.select('.listCon>div>p:nth-child(1)')
publish_times=soup.select('.listCon>div>p:nth-child(3)')
totals=soup.select('.listCon>div>div>p>strong')
2、程序逻辑
在上边的分析基础上,就可以开始程序的逻辑编写:
2.1加载基本的模块
加载后续要用到的模块
import requests
from bs4 import BeautifulSoup
import time
import random
import json
import datetime
import pymysql
2.2 ip及header准备
2.2.1 建立ip池和header池
建立ip池和header池加上代理访问,来防止网址因爬取太多被禁止
ip_pool=[
'117.26.88.236:8080',
'125.111.147.64:8080',
'218.6.107.40:8888',
'122.143.91.172:8080',
'110.90.221.125:8888',
'111.126.76.22:9999',
....]
headers_pool=['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
.....]
2.2.2 建立ip和header函数
建立ip池和header池后要建立两个函数用于随机获得ip和header
def get_ip():
ip=ip_pool[random.randrange(0,7)]
proxy_ip="http://"+ip
proxyes={'http':proxy_ip}
return proxyes
def get_headers():
return headers_pool[random.randrange(0,20)]
2.3获取网页的数据
建立网页数据获取的函数,逻辑是基于1.2的分析;
def get_info(response):
soup=BeautifulSoup(response,'html.parser')
addrs=soup.select('.listCon>div>p>a:nth-child(2)')
unit_prices=soup.select('.listCon>div>div>p:nth-child(2)')
discriptions=soup.select('.listCon>div>p:nth-child(1)')
publish_times=soup.select('.listCon>div>p:nth-child(3)')
totals=soup.select('.listCon>div>div>p>strong')
2.4 对数据进行处置并进行数据库插入
在2.2的函数后进行数据的处理及处理数据后的存储,下边用的是mysql的数据库
conn = pymysql.connect(host="你数据库地址",user="用户名",password="密码",database="数据库名",charset="utf8")
cursor = conn.cursor()
for addr,unit_price,discription,publish_time,total in zip(addrs,unit_prices,discriptions,publish_times,totals):
layout,area,direction,storey,fitUp,buildTime=discription_split(discription.get_text())
publishTime=publish_times_split(publish_time.get_text())
unitPrice=unit_price.get_text().replace('单价','').replace('元/m²','')
houseName=addr.get_text()
totalPrice=total.get_text()
sql = """INSERT into house_price(platform,house_name,layout,area,direction,storey,fit_up,build_time,publish_time,unit_price,total_price) VALUES (2,%s,%s,%s,%s,%s,%s,%s
,%s,%s,%s)"""
re=cursor.execute(sql,(houseName,layout,area,direction,storey,fitUp,buildTime,publishTime,unitPrice,totalPrice))
conn.commit()
cursor.close()
conn.close()
2.5数据处理的其他函数
2.5.1 日期处理
获取的日期里存在“今天发布”和“昨天发布”非日期的信息,将其转化为实际日期
def publish_times_split(publish_times):
split_times=publish_times.replace(' ','').split('·')
publish_time=split_times[2]
if publish_time == '今天发布':
publish_time=str(datetime.date.today())
elif publish_time =='昨天发布':
publish_time = str(datetime.date.today()-datetime.timedelta(days=1))
return publish_time
2.5.2 描述处理
房子的描述是在一个字符串里,将其拆分
def discription_split(discription):
split_discription=discription.replace(' ','').split('·')
layout=area=direction=storey=fit_up=build_time=''
try:
layout=split_discription[0]
area=split_discription[1].replace('平米','')
direction=split_discription[2]
storey=split_discription[3]
fit_up=split_discription[4]
build_time=split_discription[5]
finally:
return layout,area,direction,storey,fit_up,build_time
2.6主函数循环获取页面数据
主函数通过循环通过代理获得页面及交给数据处理是函数进行处理
if __name__ == '__main__':
for i in range(1,你要的页数):
url='https://sz.5i5j.com/ershoufang/o8q1n'+str(i)+'/'
header={'User-Agent': get_headers()}
try:
res=requests.get(url,headers=header,proxies=get_ip()).text.encode("utf-8")
get_info(res)
except requests.RequestException as e:
print(e)
time.sleep(5)
文章评论