MarkTechPost@AI 07月24日
A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何利用LangChain和PubmedQueryRun工具,构建一个高效的生物医学文献检索与分析流水线。教程详细演示了如何执行定向搜索,如“CRISPR基因编辑”,并解析、缓存和探索搜索结果。用户将学习如何提取出版日期、标题和摘要,存储查询以便即时重用,并为数据可视化或进一步分析做好准备。通过一个名为AdvancedPubMedResearcher的类,该工具集简化了程序化文献探索和智能查询过程,包括研究趋势分析、可视化以及主题比较。

📚 **PubMed文献检索与解析:** 通过PubmedQueryRun工具,可以针对特定主题(如“CRISPR基因编辑”)在PubMed上进行精确搜索,并解析返回的原始数据,提取每篇文献的出版日期、标题和摘要,同时计算摘要的词数,为后续分析奠定基础。

🗄️ **查询缓存与重用:** 该系统支持将搜索查询及其结果进行缓存,并记录时间戳,使得相同的查询可以被快速检索,无需重复执行API调用,提高了效率并方便了对历史搜索的访问。

📊 **研究趋势可视化分析:** 通过分析多个研究主题下的文献数据,能够生成直观的图表,包括按主题划分的文献数量、摘要长度分布、出版时间线以及标题中的高频词云图,帮助用户快速掌握研究领域的动态。

⚖️ **研究主题对比:** 该工具允许用户输入两个研究主题,并分别检索相关文献,然后比较它们在文献数量和平均摘要长度等方面的差异,为深入理解不同研究方向的特点提供依据。

🧠 **AI驱动的智能查询:** 集成Gemini LLM后,该系统能够处理更复杂的自然语言查询,利用AI对研究问题进行智能分析和回答,进一步提升了文献检索和信息获取的智能化水平。

In this tutorial, we are excited to introduce the Advanced PubMed Research Assistant, which guides you through building a streamlined pipeline for querying and analyzing biomedical literature. In this tutorial, we focus on leveraging the PubmedQueryRun tool to perform targeted searches, such as “CRISPR gene editing,” and then parse, cache, and explore those results. You’ll learn how to extract publication dates, titles, and summaries; store queries for instant reuse; and prepare your data for visualization or further analysis.

!pip install -q langchain-community xmltodict pandas matplotlib seaborn wordcloud google-generativeai langchain-google-genaiimport osimport reimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetime, timedeltafrom collections import Counterfrom wordcloud import WordCloudimport warningswarnings.filterwarnings('ignore')from langchain_community.tools.pubmed.tool import PubmedQueryRunfrom langchain_google_genai import ChatGoogleGenerativeAIfrom langchain.agents import initialize_agent, Toolfrom langchain.agents import AgentType

We install and configure all the essential Python packages, including langchain-community, xmltodict, pandas, matplotlib, seaborn, and wordcloud, as well as Google Generative AI and LangChain Google integrations. We import core data‑processing and visualization libraries, silence warnings, and bring in the PubmedQueryRun tool and ChatGoogleGenerativeAI client. Finally, we prepare to initialize our LangChain agent with the PubMed search capability.

class AdvancedPubMedResearcher:    """Advanced PubMed research assistant with analysis capabilities"""       def __init__(self, gemini_api_key=None):        """Initialize the researcher with optional Gemini integration"""        self.pubmed_tool = PubmedQueryRun()        self.research_cache = {}               if gemini_api_key:            os.environ["GOOGLE_API_KEY"] = gemini_api_key            self.llm = ChatGoogleGenerativeAI(                model="gemini-1.5-flash",                temperature=0,                convert_system_message_to_human=True            )            self.agent = self._create_agent()        else:            self.llm = None            self.agent = None       def _create_agent(self):        """Create LangChain agent with PubMed tool"""        tools = [            Tool(                name="PubMed Search",                func=self.pubmed_tool.invoke,                description="Search PubMed for biomedical literature. Use specific terms."            )        ]               return initialize_agent(            tools,            self.llm,            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,            verbose=True        )       def search_papers(self, query, max_results=5):        """Search PubMed and parse results"""        print(f" Searching PubMed for: '{query}'")               try:            results = self.pubmed_tool.invoke(query)            papers = self._parse_pubmed_results(results)                       self.research_cache[query] = {                'papers': papers,                'timestamp': datetime.now(),                'query': query            }                       print(f" Found {len(papers)} papers")            return papers                   except Exception as e:            print(f" Error searching PubMed: {str(e)}")            return []       def _parse_pubmed_results(self, results):        """Parse PubMed search results into structured data"""        papers = []               publications = results.split('\n\nPublished: ')[1:]               for pub in publications:            try:                lines = pub.strip().split('\n')                               pub_date = lines[0] if lines else "Unknown"                               title_line = next((line for line in lines if line.startswith('Title: ')), '')                title = title_line.replace('Title: ', '') if title_line else "Unknown Title"                               summary_start = None                for i, line in enumerate(lines):                    if 'Summary::' in line:                        summary_start = i + 1                        break                               summary = ""                if summary_start:                    summary = ' '.join(lines[summary_start:])                               papers.append({                    'date': pub_date,                    'title': title,                    'summary': summary,                    'word_count': len(summary.split()) if summary else 0                })                           except Exception as e:                print(f" Error parsing paper: {str(e)}")                continue               return papers       def analyze_research_trends(self, queries):        """Analyze trends across multiple research topics"""        print(" Analyzing research trends...")               all_papers = []        topic_counts = {}               for query in queries:            papers = self.search_papers(query, max_results=3)            topic_counts[query] = len(papers)                       for paper in papers:                paper['topic'] = query                all_papers.append(paper)               df = pd.DataFrame(all_papers)               if df.empty:            print(" No papers found for analysis")            return None               self._create_visualizations(df, topic_counts)               return df       def _create_visualizations(self, df, topic_counts):        """Create research trend visualizations"""        plt.style.use('seaborn-v0_8')        fig, axes = plt.subplots(2, 2, figsize=(15, 12))        fig.suptitle('PubMed Research Analysis Dashboard', fontsize=16, fontweight='bold')               topics = list(topic_counts.keys())        counts = list(topic_counts.values())               axes[0,0].bar(range(len(topics)), counts, color='skyblue', alpha=0.7)        axes[0,0].set_xlabel('Research Topics')        axes[0,0].set_ylabel('Number of Papers')        axes[0,0].set_title('Papers Found by Topic')        axes[0,0].set_xticks(range(len(topics)))        axes[0,0].set_xticklabels([t[:20]+'...' if len(t)>20 else t for t in topics], rotation=45, ha='right')               if 'word_count' in df.columns and not df['word_count'].empty:            axes[0,1].hist(df['word_count'], bins=10, color='lightcoral', alpha=0.7)            axes[0,1].set_xlabel('Abstract Word Count')            axes[0,1].set_ylabel('Frequency')            axes[0,1].set_title('Distribution of Abstract Lengths')               try:            dates = pd.to_datetime(df['date'], errors='coerce')            valid_dates = dates.dropna()            if not valid_dates.empty:                axes[1,0].hist(valid_dates, bins=10, color='lightgreen', alpha=0.7)                axes[1,0].set_xlabel('Publication Date')                axes[1,0].set_ylabel('Number of Papers')                axes[1,0].set_title('Publication Timeline')                plt.setp(axes[1,0].xaxis.get_majorticklabels(), rotation=45)        except:            axes[1,0].text(0.5, 0.5, 'Date parsing unavailable', ha='center', va='center', transform=axes[1,0].transAxes)               all_titles = ' '.join(df['title'].fillna('').astype(str))        if all_titles.strip():            clean_titles = re.sub(r'[^a-zA-Z\s]', '', all_titles.lower())                       try:                wordcloud = WordCloud(width=400, height=300, background_color='white',                                    max_words=50, colormap='viridis').generate(clean_titles)                axes[1,1].imshow(wordcloud, interpolation='bilinear')                axes[1,1].axis('off')                axes[1,1].set_title('Common Words in Titles')            except:                axes[1,1].text(0.5, 0.5, 'Word cloud unavailable', ha='center', va='center', transform=axes[1,1].transAxes)               plt.tight_layout()        plt.show()       def comparative_analysis(self, topic1, topic2):        """Compare two research topics"""        print(f" Comparing '{topic1}' vs '{topic2}'")               papers1 = self.search_papers(topic1)        papers2 = self.search_papers(topic2)               avg_length1 = sum(p['word_count'] for p in papers1) / len(papers1) if papers1 else 0        avg_length2 = sum(p['word_count'] for p in papers2) / len(papers2) if papers2 else 0               print("\n Comparison Results:")        print(f"Topic 1 ({topic1}):")        print(f"  - Papers found: {len(papers1)}")        print(f"  - Avg abstract length: {avg_length1:.1f} words")               print(f"\nTopic 2 ({topic2}):")        print(f"  - Papers found: {len(papers2)}")        print(f"  - Avg abstract length: {avg_length2:.1f} words")               return papers1, papers2       def intelligent_query(self, question):        """Use AI agent to answer research questions (requires Gemini API)"""        if not self.agent:            print(" AI agent not available. Please provide Gemini API key.")            print(" Get free API key at: https://makersuite.google.com/app/apikey")            return None               print(f" Processing intelligent query with Gemini: '{question}'")        try:            response = self.agent.run(question)            return response        except Exception as e:            print(f" Error with AI query: {str(e)}")            return None

We encapsulate the PubMed querying workflow in our AdvancedPubMedResearcher class, initializing the PubmedQueryRun tool and an optional Gemini-powered LLM agent for advanced analysis. We provide methods to search for papers, parse and cache results, analyze research trends with rich visualizations, and compare topics side by side. This class streamlines programmatic exploration of biomedical literature and intelligent querying in just a few method calls.

def main():    """Main tutorial demonstration"""    print(" Advanced PubMed Research Assistant Tutorial")    print("=" * 50)       # Initialize researcher    # Uncomment next line and add your free Gemini API key for AI features    # Get your free API key at: https://makersuite.google.com/app/apikey    # researcher = AdvancedPubMedResearcher(gemini_api_key="your-gemini-api-key")    researcher = AdvancedPubMedResearcher()       print("\n1⃣ Basic PubMed Search")    papers = researcher.search_papers("CRISPR gene editing", max_results=3)       if papers:        print(f"\nFirst paper preview:")        print(f"Title: {papers[0]['title']}")        print(f"Date: {papers[0]['date']}")        print(f"Summary preview: {papers[0]['summary'][:200]}...")    print("\n\n2⃣ Research Trends Analysis")    research_topics = [        "machine learning healthcare",        "CRISPR gene editing",        "COVID-19 vaccine"    ]       df = researcher.analyze_research_trends(research_topics)       if df is not None:        print(f"\nDataFrame shape: {df.shape}")        print("\nSample data:")        print(df[['topic', 'title', 'word_count']].head())    print("\n\n3⃣ Comparative Analysis")    papers1, papers2 = researcher.comparative_analysis(        "artificial intelligence diagnosis",        "traditional diagnostic methods"    )       print("\n\n4⃣ Advanced Features")    print("Cache contents:", list(researcher.research_cache.keys()))       if researcher.research_cache:        latest_query = list(researcher.research_cache.keys())[-1]        cached_data = researcher.research_cache[latest_query]        print(f"Latest cached query: '{latest_query}'")        print(f"Cached papers count: {len(cached_data['papers'])}")       print("\n Tutorial complete!")    print("\nNext steps:")    print("- Add your FREE Gemini API key for AI-powered analysis")    print("  Get it at: https://makersuite.google.com/app/apikey")    print("- Customize queries for your research domain")    print("- Export results to CSV with: df.to_csv('research_results.csv')")       print("\n Bonus: To test AI features, run:")    print("researcher = AdvancedPubMedResearcher(gemini_api_key='your-key')")    print("response = researcher.intelligent_query('What are the latest breakthrough in cancer treatment?')")    print("print(response)")if __name__ == "__main__":    main()

We implement the main function to orchestrate the full tutorial demo, guiding users through basic PubMed searches, multi‑topic trend analyses, comparative studies, and cache inspection in a clear, numbered sequence. We wrap up by highlighting the next steps, including adding your Gemini API key for AI features, customizing queries to your domain, and exporting results to CSV, along with a bonus snippet for running intelligent, Gemini-powered research queries.

In conclusion, we have now demonstrated how to harness the power of PubMed programmatically, from crafting precise search queries to parsing and caching results for quick retrieval. By following these steps, you can automate your literature review process, track research trends over time, and integrate advanced analyses into your workflows. We encourage you to experiment with different search terms, dive into the cached results, and extend this framework to support your ongoing biomedical research.


Check out the CODES here. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LangChain PubMed 生物医学文献 AI 科研工具
相关文章