[3] 웹 페이지 크롤링 하여 파일 내용 비교하여 Slack 메신저로 알람 보내기(Web Crawling)

코드들은 github에 업로드되어있습니다.

https://github.com/JaeYeongSong/Blog/tree/main/Crawling

GitHub - JaeYeongSong/Blog

Contribute to JaeYeongSong/Blog development by creating an account on GitHub.

github.com

저번시간에는 원하는 태그만 크롤링 하는 방법에 대해 알아보았습니다.

이번에는 크롤링한 데이터를 가지고, 파일 안에 있는 내용을 비교하는 방법에 대해 알아보겠습니다.

오늘은 제 티스토리 블로그에 새로운 게시물이 업로드 되었다면,

Slack 메신저로 알람을 보내는 걸 한 번 해보도록 하겠습니다.

제 티스토리 블로그에 게시물을 크롤링 해서 앞 전에 크롤링 했던 데이터와 달라진다면

Slack 메신저로 알람을 보내는 방법에 대해 알아보겠습니다.

여기서 Slack 메신저를 API로 python에서 사용하려면 Slack API 사용법을 보고 와 주시기 바랍니다.

Slack API 1편

https://xsop.tistory.com/13

[1] Slack API 사용하여 봇 만들기

여러분들 제가 코딩 파일을 업로드하면서 Slack 메신저를 많이 사용합니다. 처음으로 메신저 API를 사용한게 Slack 메신저이기 때문에, 저의 코딩에서 무언가를 메신저로 보내야 한다면 저는 Slack를

xsop.tistory.com

Slack API 2편

https://xsop.tistory.com/14

[2] Slack API 봇을 사용해서 메시지 보내기

코드들은 github에 업로드되어있습니다. https://github.com/Xsop-sop/API Xsop-sop/API Contribute to Xsop-sop/API development by creating an account on GitHub. github.com 저번 시간에는 Slack 워크스페이..

xsop.tistory.com

그럼 코드를 들고와보겠습니다.

오늘 소스코드 같은 경우에는 총 3가지가 있습니다.

from bs4 import BeautifulSoup
import requests
import sys
import time

# 홈페이지 주소 가져오기
url = "https://xsop.tistory.com/category"

tm = time.localtime()

# 현재 시간 (년, 월, 일, 시, 분) 함수 만들기
now_year_real = (tm.tm_year)
now_mon_real = (tm.tm_mon)
now_day_real = (tm.tm_mday)
now_hour_real = (tm.tm_hour)
now_min_real = (tm.tm_min)

# 문자 앞에 0 추가하는 작업
now_year = str(now_year_real).zfill(2)
now_mon = str(now_mon_real).zfill(2)
now_day = str(now_day_real).zfill(2)
now_hour = str(now_hour_real).zfill(2)
now_min = str(now_min_real).zfill(2)

html = requests.get(url)
bs_html = BeautifulSoup(html.content, "html.parser")
bsObject = bs_html.find_all(attrs={"class":"name"})

now_ymdhm = now_year + '-' + now_mon + '-' + now_day + ' ' + now_hour + '시' + ' ' + now_min + '분'

sys.stdout = open(f'D:/list/{now_ymdhm}.txt', 'w', encoding='UTF-8')

print(bsObject) # 웹 페이지 txt 파일로 출력

▲ index/Crawling.py

이 코드 같은 경우에는 크롤링 하는 코드입니다.

수정해주셔야 할 것에 대해 알려드리도록 하겠습니다.

위에 url 변수에 자신이 크롤링 할 웹 페이지 주소를 넣어주시면 됩니다.

그 다음 무슨 Tag나 class 또는 id 값을 크롤링 할건지, 아니면 페이지 전체를 크롤링 할건지 적어주시면 됩니다.

Tag(태그)	'미정값'
class	attrs={"class":"미정값"}
id	id='미정값'

미정값 : 태그, class, id 이름

bs_html.find_all() 괄호 안에 작성해주시면 됩니다.

이러한 형식으로 적어주시면 됩니다.

그 다음 아래에 sys.stdout에서

sys.stdout = open(f'D:/list/{now_ymdhm}.txt', 'w', encoding='UTF-8')

▲ index/Crawling.py 중 파일 저장 경로

open() 안에 경로를 입력해 주시면 됩니다.

그 다음에 파일 내용을 비교하는 코드를 알려드리도록 하겠습니다.

import datetime
import filecmp
import requests
import os
from time import localtime, strftime

myToken = "your key"

def post_message(token, channel, text):
    response = requests.post("https://slack.com/api/chat.postMessage",
        headers={"Authorization": "Bearer "+token},
        data={"channel": channel,"text": text}
    )
    print(response)

folder_path = 'D:/list/' # 파일 내용 비교할 파일 가져오기(경로 입력)

# each_file_path_and_gen_time: 각 file의 경로와, 생성 시간을 저장함
each_file_path_and_gen_time = []
for each_file_name in os.listdir(folder_path):
    # getctime: 입력받은 경로에 대한 생성 시간을 리턴
    each_file_path = folder_path + each_file_name
    each_file_gen_time = os.path.getctime(each_file_path)
    each_file_path_and_gen_time.append(
        (each_file_path, each_file_gen_time)
    )

# 가장 생성시각이 큰(가장 최근인) 파일을 리턴 
most_recent_file_1 = max(each_file_path_and_gen_time, key=lambda x: x[1])[0]

# 가장 생성시각이 두번째로 큰(두번쨰로 최근인) 파일을 리턴 
uniq = set(each_file_path_and_gen_time)
most_recent_file_2 = sorted(uniq, reverse=True)[1][0]

nows = datetime.datetime.now()
print_now = nows.strftime('%Y-%m-%d %H:%M:%S')

fivelasts = nows - datetime.timedelta(minutes=5)

fivelast = fivelasts.strftime('%Y-%m-%d %H시 %M분')

filecmp = filecmp.cmp(most_recent_file_1, most_recent_file_2)

if filecmp == False:
    print(f"{print_now} - 변화가 생겼습니다.")
    post_message(myToken,"#Test", f"`{print_now} - 변화가 생겼습니다.`")
    os._exit(1)

▲ index/Compare.py

이 코드는 크롤링 한 데이터를 비교하는 코드입니다.

데이터를 비교할 때 서로 내용이 같다면 True, 서로 다르다면 False가 나옵니다.

여기서 True라면 서로 내용이 같기 때문에 새로운 게시물이 업로드 되지 않았다라는 뜻이죠.

그러니까 False일 때만 변화가 생겼다고 Slack 메신저를 통해 알람을 보내면 됩니다.

일단 수정해주셔야 될 것은

myToken에 Slack API 봇의 토큰(key)을 넣어주시면 됩니다.

그리고 post_message에서 채널 경로를 바꿔주시면 되겠습니다.

제가 업로드 한 코드는 #Test 라고 저장하였습니다.

folder_path 변수에 Crawling.py에서 크롤링 한 txt 파일을 저장한 경로를 적어주시면 됩니다.

저는 Crawling.py에서 D 드라이브에 list 폴더로 경로는 지정했습니다.

사용자들에 맞게, 크롤링 프로그램에서 txt 파일을 저장하는 경로를 적어주시면 됩니다.

그 다음 마지막으로 이 두 python 프로그램을 몇 분 마다 반복적으로 돌려야 하기 때문에

start.py를 이용하여 프로그램 실행을 자동화시켜보도록 하겠습니다.

from time import sleep
import os
import threading
import datetime
import requests

myToken = "your key"

def post_message(token, channel, text):
    response = requests.post("https://slack.com/api/chat.postMessage",
        headers={"Authorization": "Bearer "+token},
        data={"channel": channel,"text": text}
    )
    print(response)

now_times = datetime.datetime.now()
nows = now_times.strftime('%Y-%m-%d %H:%M:%S')

text_intro = "-------------------------------------------------"
print(f"{nows} - 프로그램이 시작되었습니다.")
text_intro_01 = f"{nows} - 프로그램이 시작되었습니다."

print(f"{nows} - 5분(300초)마다 티스토리에서 값을 불러와 비교합니다.")
text_intro_02 = f"{nows} - 5분(300초)마다 티스토리에서 값을 불러와 비교합니다."

post_message(myToken,"#Test", text_intro)
post_message(myToken,"#Test", text_intro_01)
post_message(myToken,"#Test", text_intro_02)

sleep(5)

re_now_times = datetime.datetime.now()
re_nows = re_now_times.strftime('%Y-%m-%d %H:%M:%S')

print(f"{nows} - Crawling 프로그램이 정상적으로 실행되었습니다.")
print(f"{nows} - 300초 뒤에 프로그램이 반복 실행됩니다.")
os.system('D:/Crawling.py')

sleep(300)

def restart():
    now_times = datetime.datetime.now()
    now = now_times.strftime('%Y-%m-%d %H:%M:%S')
    
    print(f"{now} - Crawling 프로그램이 정상적으로 실행되었습니다.")
    os.system('D:/Crawling.py')

    print(f"{now} - Compare 프로그램이 정상적으로 실행되었습니다.")
    os.system('D:/Compare.py')

    threading.Timer(300, restart).start()

restart()

▲ index/start.py

이 프로그램은 start.py 입니다.

방금전에 설명드린 Crawling.py, Compare.py를 자동적으로 실행할 수 있게 만든 프로그램입니다.

일단 수정해야 될 것에 대해 알려드리도록 하겠습니다.

myToken에 Slack API 봇의 토큰(key)을 넣어주시면 됩니다.

그리고 post_message에서 채널 경로를 바꿔주시면 되겠습니다.

제가 업로드 한 코드는 #Test 라고 저장하였습니다.

그 다음 os.system()에 실행할 Crawling.py, Compare.py 이 두 프로그램의 경로를 지정해주시면 됩니다.

그 다음에 마지막에

threading.Timer(300, restart).start()

▲ index/start.py 중 반복시간 설정

몇 초 마다 반복할 지 정해주시면 됩니다.

초(s)로 설정이 되어있기 때문에 5분은 300초 10분은 600초 이렇게 정해주시면 됩니다.

지금까지 읽어주셔서 감사합니다.

지금까지 웹 페이지 크롤링 하여 파일 내용 비교하여 Slack 메신저로 알람 보내기 하는 방법이었습니다.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[3] 웹 페이지 크롤링 하여 파일 내용 비교하여 Slack 메신저로 알람 보내기(Web Crawling)

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역