抓取學校公告涉及一些處理xml或是html格式的用法，所以我們先來簡單介紹一下BeautifulSoup4這個套件

BeautifulSoup4

BeautifulSoup4（簡稱 bs4）是一個 Python 套件，用來解析 HTML 或 XML 文件。它可以幫助你從網頁中輕鬆提取資料，不需要處理複雜的字串操作。

不要問為什麼叫美麗湯，因為套件的作者爽這樣叫

初始化

開始之前，請先安裝這個套件

1	pip install beautifulsoup4

然後用這個方法建立一個解析器

from bs4 import BeautifulSoup

# 範例資料
html = """
<html>
  <body>
    <h1>Hello World</h1>
    <p>這是一段<br/>文字</p>
    <p>這是第二段文字</p>
  </body>
</html>
"""
# h1: 1級標題
# p: 段落文字
# br: 換行
soup = BeautifulSoup(html, "html.parser")

從網頁取得的原檔會長這樣
如果我們要提取特定的資料，我們原本需要做很複雜的字串處理來找出特定區段的文字
但是bs4可以幫我們輕鬆做到

find和find_all

find()可以回傳第一個符合條件的元素
要注意的是，回傳的是「元素」而不是裡面的內容

1
2
3

soup.find("p")
soup.p # 這個也可以
# <p>這是一段<br/>文字</p>

find_all()可以回傳所有符合條件的元素的陣列

1 2	soup.find_all("p") # [<p class="text">這是一段</br>文字</p>, <p>這是第二段文字</p>]

取得文字

只要取一個tag的text屬性就可以取得裡面的文字
由於<br>標籤是另外處理的，所以取得文字的時候會直接忽略

1
2
3

tag = soup.p
print(tag.text)
# 這是一段文字

如果直接取更上層的標籤(比如說body)
或是直接取soup本身的text
可以直接取得內部所有文字，不用取到特定的標籤

print(soup.text)
# Hello World
# 這是一段文字
# 這是第二段文字

這樣我們就不用處理複雜的字串，就可以把標籤剔除掉，留下文字

處理換行

我們可以利用replace_with()來替換掉整個標籤

for br in soup.find_all('br'): # 找出所有代表換行的標籤
    br.replace_with('\n') # 把它替換成python字串的換行符號

print(soup.text)

# Hello World
# 這是一段
# 文字
# 這是第二段文字

抓取公告資料

首先去到學校網站，找到最新消息，滑到最底下找到這個符號

RSS feed是一個類似API的工具，資訊提供方會即時更新這份文件，而這份文件也用結構化的方式讓我們可以更好取得內容，因為每一項資料都在一個有名稱的標籤裡

點進去以後可以看到很類似html的結構，實際上是xml
每一個<item>標籤下都包含<title> <link> <description>和<pubDate>
這些我們都可以用bs4來結構化方便取用內容

請複製這個網址，並開啟新的測試用python檔案

基礎測試：抓取公告資料

我們可以用requests來取得資料後，再丟給bs4處理

import requests
from bs4 import BeautifulSoup

feed = requests.get('https://www.hs.ntnu.edu.tw/rssfeeds?a=T0RESTEyNTIxODAyNTk2MjI2MVRDaW50ZWx5&b=T0RESTYyaW50ZWx5&c=T0RESU1EVTNOakl5TXpjPXdBek55SWpOeElrVGludGVseQ==').text

soup = BeautifulSoup(feed, 'xml')

到這裡先執行看看，如果出現以下錯誤
請用pip安裝lxml套件

1	bs4.exceptions.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

接著我們嘗試取得標題

items = soup.find_all('item') #先取得所有item
for item in items:
    print(item.title.text) # 印出item底下的title的內容

# 或是只取第一項
item = soup.item
print(item.title.text)

內文處理

目前為止都還順利，只不過有一個小問題：
你仔細讀讀看取得的description，裡面是html格式而不是純文字
這意味著我們需要另開一個soup來處理
只要使用我們前面提到的技巧即可

我們先熟悉如何取一篇文章的文字就好

content = soup.item.description.text

soup2 = BeautifulSoup(content, 'html.parser')

for br in soup2.find_all('br'):
    br.replace_with('\n')

print(soup2.text)

如此我們就成功取出一篇文章的文字了，你也可以包成函數，只要傳一個item進去就好

def extract(item):
    
    soup = BeautifulSoup(item.description.text, 'html.parser')
    for br in soup.find_all('br'):
        br.replace_with('\n')

    return(soup.text)

# 這裡先取第一篇就好
item = soup.item
print(extract(item))

融合機器人

這裡先講解思路：
先抓取公告後，建立一個選單，並且把所有item元素輸入進去選單物件內
利用self.add_option()動態加入選項，並把value設為該選項在item陣列的索引值

使用者選擇之後，我們就會拿到value，也就是使用者選擇的項目在陣列裡面的位置
取得該item後，就可以利用前面的方法取出標題、時間、內文和連結了

指令

#import部分
from bs4 import BeautifulSoup
from discord import ui
import requests

@bot.tree.command()
async def anno(interaction:discord.Interaction):
    await interaction.response.defer() # 一定要進思考，因為取得資料需要時間

    feed = requests.get('https://www.hs.ntnu.edu.tw/rssfeeds?a=T0RESTEyNTIxODAyNTk2MjI2MVRDaW50ZWx5&b=T0RESTYyaW50ZWx5&c=T0RESU1EVTNOakl5TXpjPXdBek55SWpOeElrVGludGVseQ==').text

    soup = BeautifulSoup(feed, 'xml')
    items = soup.find_all('item') # 取得所有文章的列表

    view = ui.View()
    view.add_item(Select(items))

    await interaction.followup.send(view=view)

選單

選單的部分，因為我們需要有標題跟他在陣列中的位置
我們使用一個特別的小技巧：enumerate()
這個函數可以當作for迴圈迭代的對象，可以輸出一個元素在陣列中的位置跟值
輸出方式為(位置, 值)

fruits = ['apple', 'banana', 'guava']

for index, fruit in enumerate(fruits):
    print(f"{fruit}的位置是{index}") 

# 輸出結果是:
"""
apple的位置是0
banana的位置是1
guava的位置是2
"""

class Select(ui.Select):
    def __init__(self, items):
        super().__init__(
            placeholder="請選擇一個標題",
            min_values=1,
            max_values=1
        )

        self.items = items # 先存到屬性裡面，因為callback需要
        
        for index, item in enumerate(items):
            # 把label設為標題，value則設為位置(value的要求是str所以我們轉一下格式)，描述則設為日期
            self.add_option(label=item.title.text, value=str(index), description=item.pubDate.text) 

    async def callback(self, interaction:discord.Interaction):
        await interaction.response.defer()

        # 取得使用者選的公告索引值，然後從self.items裡面找出來
        item = self.items[int(self.values[0])] # value是字串，把它轉回整數
        content = extract(item)

        # 用三個引號組成的字串可以像這樣換行寫
        # 記得字串裡面不用考慮縮排，所以要頂到最旁邊寫
        message = f"""
## {item.title.text}
{item.pubDate.text}

{content}
""".strip() # 這個函數可以移除多餘的換行
        view = ui.View()
        view.add_item(Link(item.link.text))
        await interaction.message.edit(content=message, view=view)

連結按鈕

除了直接送訊息外，有些附檔案我們沒辦法傳
所以我們要讓使用者可以連結到原文章，這裡就使用之前的連結按鈕就好了
這樣一來我們也覆蓋掉了原本的選單，做到一次性使用

1
2
3

class Link(ui.Button):
    def __init__(self, link):
        super().__init__(label="點我前往原文", url=link)

整段程式碼

# 多加的import
import requests
from bs4 import BeautifulSoup
from discord import ui

@bot.tree.command()
async def anno(interaction:discord.Interaction):
    await interaction.response.defer() # 一定要進思考，因為取得資料需要時間

    feed = requests.get('https://www.hs.ntnu.edu.tw/rssfeeds?a=T0RESTEyNTIxODAyNTk2MjI2MVRDaW50ZWx5&b=T0RESTYyaW50ZWx5&c=T0RESU1EVTNOakl5TXpjPXdBek55SWpOeElrVGludGVseQ==').text

    soup = BeautifulSoup(feed, 'xml')
    items = soup.find_all('item') # 取得所有文章的列表

    view = ui.View()
    view.add_item(Select(items))

    await interaction.followup.send(view=view)

def extract(item):
    
    soup = BeautifulSoup(item.description.text, 'html.parser')
    for br in soup.find_all('br'):
        br.replace_with('\n')

    return(soup.text)

class Select(ui.Select):
    def __init__(self, items):
        super().__init__(
            placeholder="請選擇一個標題",
            min_values=1,
            max_values=1
        )

        self.items = items # 先存到屬性裡面，因為callback需要
        
        for index, item in enumerate(items):
            # 把label設為標題，value則設為位置(value的要求是str所以我們轉一下格式)，描述則設為日期
            self.add_option(label=item.title.text, value=str(index), description=item.pubDate.text)  

    async def callback(self, interaction:discord.Interaction):
        await interaction.response.defer()

        # 取得使用者選的公告索引值，然後從self.items裡面找出來
        item = self.items[int(self.values[0])] # value是字串，把它轉回整數
        content = extract(item)

        # 用三個引號組成的字串可以像這樣換行寫
        # 記得字串裡面不用考慮縮排，所以要頂到最旁邊寫
        message = f"""
## {item.title.text}
{item.pubDate.text}

{content}
""".strip() # 這個函數可以移除多餘的換行
        view = ui.View()
        view.add_item(Link(item.link.text))
        await interaction.message.edit(content=message, view=view)

class Link(ui.Button):
    def __init__(self, link):
        super().__init__(label="點我前往原文", url=link)