Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
抓取HEY-Screener邮件
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该脚本通过Python和BeautifulSoup库从HEY-Screener平台抓取被拒绝和批准的电子邮件地址。它使用requests库发送HTTP请求,并解析HTML页面以提取电子邮件。脚本支持分页处理,通过循环点击“ Older”按钮遍历所有页面。抓取的电子邮件被存储在两个文本文件中,一个用于被拒绝的邮件,另一个用于批准的邮件。

🔍 该脚本使用Python和BeautifulSoup库从HEY-Screener平台抓取被拒绝和批准的电子邮件地址。它通过发送HTTP请求并解析HTML页面来实现数据提取。

📂 抓取的电子邮件被存储在两个文本文件中,一个名为'denied_emails.txt',用于存储被拒绝的邮件地址;另一个名为'approved_emails.txt',用于存储批准的邮件地址。

🔁 脚本支持分页处理,通过循环点击页面上的“ Older”按钮来遍历所有页面,确保抓取所有相关页面的电子邮件地址。

🔐 脚本使用环境变量存储HEY-Screener的会话cookie,确保请求的认证和安全性。用户需要提前获取并设置环境变量中的cookie值。

⚙️ 脚本的主要功能包括:发送HTTP请求、解析HTML页面、提取电子邮件地址、存储电子邮件地址到文件。

If you are here, how the workflow works, there will be an upcoming note HEY-Screener in (Neo)Mutt, but for now, you can check all scripts on my Mutt dotfiles.

# Grabbing emails from ScreenedIn and Out from the current screener page

This is the Screener HEY URL: https://app.hey.com/my/clearances?page=3 we want to scrape from. The tag to grab is screened-person--denied and screened-person--approved.

This is the second option I created after the Console one didn’t scale and only for one page.

Then I tried to find an open API (see further below how you can find the API). As I found one, I used Python to loop through all pages with the “older” button, and then do the same again:

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import requestsfrom bs4 import BeautifulSoupimport osdef scrape_emails(url, cookies):    page = 1    denied_emails = []    approved_emails = []    with requests.Session() as session:        while True:            response = session.get(url, params={"page": page}, cookies=cookies)            soup = BeautifulSoup(response.text, "html.parser")            # Extract emails            for element in soup.select(".screened-person--denied"):                email = element.select_one(".screened-person__details span")                if email:                    denied_emails.append(email.get_text(strip=True))            for element in soup.select(".screened-person--approved"):                email = element.select_one(".screened-person__details span")                if email:                    approved_emails.append(email.get_text(strip=True))            # Check for the 'Older' button/link            next_page_link = soup.select_one(                'a.paginator__next[href*="/my/clearances?page="]'            )            if not next_page_link:                break  # No more pages            page += 1            # if page == 3:            #     break    return denied_emails, approved_emailsdef write_to_file(filename, email_list):    with open(filename, "w") as file:        for email in email_list:            file.write(f"{email}\n")cookies = {    # Set ENV variable with hey cookie. Load the screener and search in network tab for `https://app.hey.com/my/clearances?page=` request.    # There you see the cookies used. Might need to change after re-login    "_csrf_token": os.getenv("HEY_COOKIE"),}url = "https://app.hey.com/my/clearances"denied_emails, approved_emails = scrape_emails(url, cookies)# Write the lists to fileswrite_to_file("denied_emails.txt", denied_emails)write_to_file("approved_emails.txt", approved_emails)print("Denied Emails:", denied_emails)print("Approved Emails:", approved_emails)

See the latest version on GitHub.

Make sure to set the ENV cookie. You can achieve that by loading the screener and searching in the network tab for https://app.hey.com/my/clearances?page= request.

There you see the cookies used. Might need to change after re-login.
See below:

# Manually per page: Console (JavaScript)

Use the Console in the Developer mode of the browser and extract all ScreenedIn/Out from the current page:

 1 2 3 4 5 6 7 8 91011121314151617181920
const extractEmails = (className) => {    const emails = [];    document.querySelectorAll(`.${className}`).forEach(element => {        const emailElement = element.querySelector('.screened-person__details');        if (emailElement) {            const emailText = emailElement.textContent.trim();            const emailMatch = emailText.match(/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/);            if (emailMatch) {                emails.push(emailMatch[0]);            }        }    });    return emails;};const deniedEmails = extractEmails('screened-person--denied');const approvedEmails = extractEmails('screened-person--approved');console.log('Denied Emails:', deniedEmails);console.log('Approved Emails:', approvedEmails);

Origin: HEY-Screener in (Neo)Mutt
References: Getting the Data – Scraping
Created 2023-11-21

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

HEY-Screener Python BeautifulSoup 网页抓取 电子邮件
相关文章