Python实现网络爬虫

python-web-crawler

2022年03月23日 16点39分41秒 • 技术 • 5711字 •

前言

最近在参与一个数据收集的项目，需要大量获取图像及链接等，用人力显然是完成不过来了，于是索性就做个爬虫，一劳永逸了。这里因为项目比较小，对效率要求不大，就选择了使用 Python 而不是 C 语言。 (也因为 Python 用起来更省事) 本文所含代码可直接跳转#代码查看

效果

img1

效果如上图，即输入网页链接，自动提取所含图片链接，同时自动转化相对路径为绝对路径，方便下载。最后每行一个 print 出来，方便统一存储/下载。

实现方式

Python 在爬虫方面已经十分成熟，这里引用第三方库 BeautifulSoup 与 urllib，若无这些库请下载:

pip install bs4
pip install urllib

*命令行执行即可

依赖库准备完后，引用：

from urllib.request import urlopen,build_opener,ProxyHandler
from bs4 import BeautifulSoup as bf
from urllib import request
import random

此处引用 random 以及 build_opener 与 ProxyHandler 是为了后续反爬， (毕竟默认 UA 是 Python.Urllib) 接着配置 UA 池与 IP 代理池，防止被反爬(若项目规模较小可忽略此步)

UA

user_agent_list = [
"Mozilla/5.0(Macintosh;IntelMacOSX10.6;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
"Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1)",
"Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11",
"Mozilla/5.0(Macintosh;IntelMacOSX10_7_0)AppleWebKit/535.11(KHTML,likeGecko)Chrome/17.0.963.56Safari/535.11",
"Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)",
"Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;Trident/4.0;SE2.XMetaSr1.0;SE2.XMetaSr1.0;.NETCLR2.0.50727;SE2.XMetaSr1.0)"
]

随机 UA

headers ={
'User-Agent':random.choice(user_agent_list) ## 随机抽取UA
}
ip_list=[
'125.120.62.26', ##IP池
'66.249.93.118'
]

IP


ip={
'http':random.choice(ip_list) ##随机抽取 IP
}
link = input("在此输入网址:http://")
htmlurl = "https://"+str(link) #链接整合，若 input 中输入了带 http 头的链接可忽略此行
req = request.Request(htmlurl,headers=headers) #请求整合

其中，ip_list 推荐使用Github@jhao104/proxy_pool开源的 IP 代理池。 代码中所列 IP 均为演示作用，若需应用请自行设置 在此就完成了 UA 与 IP 的随机分配，反爬基本完成 不过反爬归反爬，也请自觉遵守 robot 协议，合理利用爬虫 下一步，发出请求：

用 ProxyHandler 创建代理 ip 对象

pro_han = ProxyHandler(ip)

使用 build_opener 替代 urlopen()创建一个对象

opener = build_opener(pro_han)

发送请求

res = opener.open(req)

到这里为止，整个请求结束，之后用 BeautifulSoup 解析: *下面已用 bs 代指 beautifulsoup

obj = bf(res.read(),'html.parser') #解析 html
title = str(obj.head.title) #获取标题
print("站点标题:",title,"正在查找图片")
pic_info = obj.find_all('img') #查询 img 标签

这里也给出不含反爬的请求： (基本同上，唯一的区别是直接用 urlopen 打开链接)

html = urlopen("https://"+link)
obj = bs(html.read(),'html.parser') #解析 html
title = str(obj.head.title) #获取标题
print("站点标题:",title,"正在查找图片")
pic_info = obj.find_all('img') #查询 img 标签

到这里我们已经成功将网页中所含的所有 img 标签以列表形式存储在了变量 pic_info 中，接下来遍历输出即可：

j = 0 #配置遍历
for i in pic_info:
    j += 1
    pic = str(i['src']) #转为字符串，方便查询
    if "http" not in pic: #检测 http 头
        if "data" in pic: #检测是否为 DataURIScheme
            continue
        else:
            if "//" in pic: #格式补全
                print("http:"+pic)
            else:
                if pic[0] == "/": #适配相对路径
                    print("http://"+link+pic)
                else:
                    print("http://"+link+"/"+pic)
    else:
        print(pic) #直接 print 绝对路径

上图套了四个 if-else，作用分别是检测是否有 http 头、是否为内嵌 base64 图片、是否以//简写路径、是否使用相对路径，到这里为止，整个程序就结束了整个示例程序可分为引用-配置-请求-分析-输出 5 个部分，除了爬取图片，也可将上面的pic_info = obj.find_all('img')改成其他标签，比如改成 meta 可爬取简介，也可在特定站点内通过 zaifindall 内添加对应的 class(class="xxx")及 id(id_="xxx")来获取对应标签内的信息，实现更多功能。

代码

完整版

from urllib.request import urlopen,build_opener,ProxyHandler
from bs4 import BeautifulSoup as bf
from urllib import request
import random
# UA
user_agent_list = [
    "Mozilla/5.0(Macintosh;IntelMacOSX10.6;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
    "Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1)",
    "Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11",
    "Mozilla/5.0(Macintosh;IntelMacOSX10_7_0)AppleWebKit/535.11(KHTML,likeGecko)Chrome/17.0.963.56Safari/535.11",
    "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)",
    "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;Trident/4.0;SE2.XMetaSr1.0;SE2.XMetaSr1.0;.NETCLR2.0.50727;SE2.XMetaSr1.0)"
]
# 随机UA
headers ={
    'User-Agent':random.choice(user_agent_list)
}
ip_list=[
    '209.97.171.128',
    '114.250.25.19',
    '125.120.62.26',
    '66.249.93.118',
    '1.202.113.240',
]
# IP
ip={
    'http':random.choice(ip_list)
}
link = input("在此输入网址:http://")
htmlurl = "https://"+str(link)
req = request.Request(htmlurl,headers=headers)
# 创建代理ip对象
pro_han = ProxyHandler(ip)
# 使用build_opener创建一个对象
opener = build_opener(pro_han)
# 发送请求
res = opener.open(req)
obj = bf(res.read(),'html.parser') #解析html
title = str(obj.head.title)
print("站点标题:",title,"正在查找图片")
pic_info = obj.find_all('img')
j = 0 #配置遍历
for i in pic_info:
    j += 1
    pic = str(i['src'])
    if "http" not in pic:
        if "data" in pic:
            continue
        else:
            if "//" in pic:
                print("http:"+pic) 
            else:
                if pic[0] == "/":
                    print("http://"+link+pic)  
                else:
                    print("http://"+link+"/"+pic) 
          
    else:
        print(pic)

基础版

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
link = input("在此输入网址:http://")
html = urlopen("https://"+link)
obj = bs(html.read(),'html.parser') #解析html
title = str(obj.head.title)
print("站点标题:",title,"正在查找图片")
pic_info = obj.find_all('img')
j = 0 #配置遍历
for i in pic_info:
    j += 1
    pic = str(i['src'])
    if "http" not in pic:
        if "data" in pic:
            continue
        else:
            if "//" in pic:
                print("http:"+pic) 
            else:
                if pic[0] == "/":
                    print("http://"+link+pic)  
                else:
                    print("http://"+link+"/"+pic) 
          
    else:
        print(pic)

原创内容使用知识共享署名-非商业性使用-相同方式共享 4.0 (CC BY-NC-ND 4.0)协议授权。转载请注明出处。

最后编辑于 2024年07月29日 10点25分18秒

aaf69a18-e0f8-4cd2-b35c-ad00d4e0cc17

上一篇
论静态页中伪动态的实现 下一篇
JS递归遍历伪数组

0/1000

登录后发送评论

RavelloH's Blog

LOADing...

Python实现网络爬虫

python-web-crawler

前言

效果

实现方式

UA

随机 UA

IP

用 ProxyHandler 创建代理 ip 对象

使用 build_opener 替代 urlopen()创建一个对象

发送请求

代码

RavelloH's

BLOG

框架状态

00:00

通知中心