一个小爬虫

2020-06-16 PHD_life thinkings 0 评论字数统计: 403(字) 阅读时长: 2(分)

今天就写了个爬CVPR年会的爬虫，，没别的，代码如下

# -*- coding: utf-8 -*-
import re
import requests
import urllib.request
import os
import argparse

parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--keyword',type=str, default='detection')
#传参匹配
args = parser.parse_args()

#获得网页信息
r = requests.get('http://openaccess.thecvf.com/CVPR2020.py')  #修改年份看这里
#信息保存在json文件里，没有保存在text因为我不喜欢
data = r.text
#获取PDF links
linklist = re.findall(r"(?<=href=\").+?pdf(?=\">pdf)|(?<=href=\').+?pdf(?=\">pdf)" ,data)
namelist = re.findall(r"(?<=href=\").+?2020_paper.html\">.+?</a>" ,data)   #还要看这里

cnt = 0
num = len(linklist)

#local path
localpath = './CVPR2020/{}/'.format(args.keyword)    #修改年份看这里
if not os.path.exists(localpath):
    os.makedirs(localpath)
while cnt < num:
    url = linklist[cnt] # define download url
    filename = namelist[cnt].split('<')[0].split('>')[1]# distribute file name from list
    filename = filename.replace(':','_')
    filename = filename.replace('\"','_')   
    filename = filename.replace('?','_')
    filename = filename.replace('/','_') 
    filename = filename.replace('+','_')
    filename = filename.replace(' ','_')
    searchlist = filename.split('_')
    searchmodel = re.compile(r'{}'.format(args.keyword),re.IGNORECASE)
    
    download_next_paper = True
    
    if ([True for i in searchlist if searchmodel.findall(i)]):
        download_next_paper = False
        
    if download_next_paper:
        cnt += 1
        continue
        
    filepath = localpath + filename + '.pdf'
    if os.path.exists(filepath):
        print('file [{}.pdf] exist, skip downloading')
        cnt += 1
        continue
    else:
        print('['+str(cnt)+"/"+str(num)+"] Downloading -> "+filepath)
        try:
        
            urllib.request.urlretrieve('http://openaccess.thecvf.com/'+url,filepath)
        except :
            print('download failed: ' + filepath)
        cnt += 1
        
print('finished')

本文链接： https://i.jiajiarobot.top/2020/06/16/pachong/heox/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

Arthur.Xu

博士在读；方向：Soft Robot & SoC Design

一个小爬虫

Arthur.Xu博士在读；方向：Soft Robot & SoC Design