MarkTechPost@AI 09月08日
构建和使用生物信息学AI代理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程演示了如何使用Biopython和Python库构建一个高级且易于使用的生物信息学AI代理,并能在Google Colab中无缝运行。通过将序列检索、分子分析、可视化、多序列比对、系统发育树构建和基序搜索整合到一个简化的类中,教程提供了一种探索生物序列分析全貌的实践方法。用户可以从内置的样本序列(如SARS-CoV-2 Spike蛋白、人类胰岛素前体和E. coli 16S rRNA)开始,或直接从NCBI获取自定义序列。借助Plotly和Matplotlib驱动的内置可视化工具,研究人员和学生无需除Colab笔记本外的任何预设即可快速进行全面的DNA和蛋白质分析。

✨ **核心功能整合**: 该生物信息学AI代理将序列检索、分子分析、可视化、多序列比对、系统发育树构建和基序搜索等关键生物信息学任务整合到一个易于使用的Python类中,极大地简化了工作流程。

🧬 **灵活的序列处理**: 用户既可以通过NCBI的Entrez模块获取序列,也可以使用教程中提供的SARS-CoV-2 Spike蛋白、人类胰岛素前体和E. coli 16S rRNA等内置样本序列,为各种研究需求提供了灵活性。

📊 **交互式可视化**: 集成了Plotly和Matplotlib等库,代理能够生成丰富的交互式图表,用于展示序列组成、GC含量、分子量以及系统发育树等,帮助用户直观地理解分析结果。

🚀 **全面的分析流程**: 该代理提供了一个端到端的分析流程,包括基础序列分析、蛋白质结构分析、比较分析、密码子使用分析和GC含量滑动窗口分析,满足从基础到进阶的多种生物信息学研究需求。

💻 **易于部署和使用**: 专为Google Colab环境设计,用户只需安装少量依赖库即可快速启动和运行,无需复杂的本地环境配置,极大地降低了生物信息学分析的门槛。

In this tutorial, we demonstrate how to build an advanced yet accessible Bioinformatics AI Agent using Biopython and popular Python libraries, designed to run seamlessly in Google Colab. By combining sequence retrieval, molecular analysis, visualization, multiple sequence alignment, phylogenetic tree construction, and motif searches into a single streamlined class, the tutorial provides a hands-on approach to explore the full spectrum of biological sequence analysis. Users can start with built-in sample sequences such as the SARS-CoV-2 Spike protein, Human Insulin precursor, and E. coli 16S rRNA, or fetch custom sequences directly from NCBI. With built-in visualization tools powered by Plotly and Matplotlib, researchers and students alike can quickly perform comprehensive DNA and protein analyses without needing prior setup beyond a Colab notebook. Check out the FULL CODES here.

!pip install biopython pandas numpy matplotlib seaborn plotly requests beautifulsoup4 scipy scikit-learn networkx!apt-get update!apt-get install -y clustalwimport osimport requestsimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom Bio import SeqIO, Entrez, Align, Phylofrom Bio.Seq import Seqfrom Bio.SeqRecord import SeqRecordfrom Bio.SeqUtils import gc_fraction, molecular_weightfrom Bio.SeqUtils.ProtParam import ProteinAnalysisfrom Bio.Blast import NCBIWWW, NCBIXMLfrom Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructorimport warningswarnings.filterwarnings('ignore')Entrez.email = "your_email@example.com"

We begin by installing essential bioinformatics and data science libraries, along with ClustalW for sequence alignment. We then import Biopython modules, visualization tools, and supporting packages, while setting up Entrez with our email to fetch sequences from NCBI. This ensures our Colab environment is fully prepared for advanced sequence analysis. Check out the FULL CODES here.

class BioPythonAIAgent:   def __init__(self, email="your_email@example.com"):       self.email = email       Entrez.email = email       self.sequences = {}       self.analysis_results = {}       self.alignments = {}       self.trees = {}     def fetch_sequence_from_ncbi(self, accession_id, db="nucleotide", rettype="fasta"):       try:           handle = Entrez.efetch(db=db, id=accession_id, rettype=rettype, retmode="text")           record = SeqIO.read(handle, "fasta")           handle.close()           self.sequences[accession_id] = record           return record       except Exception as e:           print(f"Error fetching sequence: {str(e)}")           return None     def create_sample_sequences(self):       covid_spike = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"             human_insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"             e_coli_16s = "AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGCGTTAAGGTTAATAACCTTGGCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA"             sample_sequences = [           ("COVID_Spike", covid_spike, "SARS-CoV-2 Spike Protein"),           ("Human_Insulin", human_insulin, "Human Insulin Precursor"),           ("E_coli_16S", e_coli_16s, "E. coli 16S rRNA")       ]             for seq_id, seq_str, desc in sample_sequences:           record = SeqRecord(Seq(seq_str), id=seq_id, description=desc)           self.sequences[seq_id] = record             return sample_sequences     def analyze_sequence(self, sequence_id=None, sequence=None):       if sequence_id and sequence_id in self.sequences:           seq_record = self.sequences[sequence_id]           seq = seq_record.seq           description = seq_record.description       elif sequence:           seq = Seq(sequence)           description = "Custom sequence"       else:           return None             analysis = {           'length': len(seq),           'composition': {}       }             for base in ['A', 'T', 'G', 'C']:           analysis['composition'][base] = seq.count(base)             if 'A' in analysis['composition'] and 'T' in analysis['composition']:           analysis['gc_content'] = round(gc_fraction(seq) * 100, 2)           try:               analysis['molecular_weight'] = round(molecular_weight(seq, seq_type='DNA'), 2)           except:               analysis['molecular_weight'] = len(seq) * 650             try:           if len(seq) % 3 == 0:               protein = seq.translate()               analysis['translation'] = str(protein)               analysis['stop_codons'] = protein.count('*')                             if '*' not in str(protein)[:-1]:                   prot_analysis = ProteinAnalysis(str(protein)[:-1])                   analysis['protein_mw'] = round(prot_analysis.molecular_weight(), 2)                   analysis['isoelectric_point'] = round(prot_analysis.isoelectric_point(), 2)                   analysis['protein_composition'] = prot_analysis.get_amino_acids_percent()       except:           pass             key = sequence_id if sequence_id else "custom"       self.analysis_results[key] = analysis             return analysis     def visualize_composition(self, sequence_id):       if sequence_id not in self.analysis_results:           return             analysis = self.analysis_results[sequence_id]             fig = make_subplots(           rows=2, cols=2,           specs=[[{"type": "pie"}, {"type": "bar"}],                  [{"colspan": 2}, None]],           subplot_titles=("Nucleotide Composition", "Base Count", "Sequence Properties")       )             labels = list(analysis['composition'].keys())       values = list(analysis['composition'].values())             fig.add_trace(           go.Pie(labels=labels, values=values, name="Composition"),           row=1, col=1       )             fig.add_trace(           go.Bar(x=labels, y=values, name="Count", marker_color=['red', 'blue', 'green', 'orange']),           row=1, col=2       )             properties = ['Length', 'GC%', 'MW (kDa)']       prop_values = [           analysis['length'],           analysis.get('gc_content', 0),           analysis.get('molecular_weight', 0) / 1000       ]             fig.add_trace(           go.Scatter(x=properties, y=prop_values, mode='markers+lines',                     marker=dict(size=10, color='purple'), name="Properties"),           row=2, col=1       )             fig.update_layout(           title=f"Comprehensive Analysis: {sequence_id}",           showlegend=False,           height=600       )             fig.show()     def perform_multiple_sequence_alignment(self, sequence_ids):       if len(sequence_ids) < 2:           return None             sequences = []       for seq_id in sequence_ids:           if seq_id in self.sequences:               sequences.append(self.sequences[seq_id])             if len(sequences) < 2:           return None             from Bio.Align import PairwiseAligner       aligner = PairwiseAligner()       aligner.match_score = 2       aligner.mismatch_score = -1       aligner.open_gap_score = -2       aligner.extend_gap_score = -0.5             alignments = []       for i in range(len(sequences)):           for j in range(i+1, len(sequences)):               alignment = aligner.align(sequences[i].seq, sequences[j].seq)[0]               alignments.append(alignment)             return alignments     def create_phylogenetic_tree(self, alignment_key=None, sequences=None):       if alignment_key and alignment_key in self.alignments:           alignment = self.alignments[alignment_key]       elif sequences:           records = []           for i, seq in enumerate(sequences):               record = SeqRecord(Seq(seq), id=f"seq_{i}")               records.append(record)           SeqIO.write(records, "temp.fasta", "fasta")                     try:               clustalw_cline = ClustalwCommandline("clustalw2", infile="temp.fasta")               stdout, stderr = clustalw_cline()               alignment = AlignIO.read("temp.aln", "clustal")               os.remove("temp.fasta")               os.remove("temp.aln")               os.remove("temp.dnd")           except:               return None       else:           return None             calculator = DistanceCalculator('identity')       dm = calculator.get_distance(alignment)             constructor = DistanceTreeConstructor()       tree = constructor.upgma(dm)             tree_key = f"tree_{len(self.trees)}"       self.trees[tree_key] = tree             return tree     def visualize_tree(self, tree):       fig, ax = plt.subplots(figsize=(10, 6))       Phylo.draw(tree, axes=ax)       plt.title("Phylogenetic Tree")       plt.tight_layout()       plt.show()     def protein_structure_analysis(self, sequence_id):       if sequence_id not in self.sequences:           return None             seq = self.sequences[sequence_id].seq             try:           if len(seq) % 3 == 0:               protein = seq.translate()               if '*' not in str(protein)[:-1]:                   prot_analysis = ProteinAnalysis(str(protein)[:-1])                                     structure_analysis = {                       'molecular_weight': prot_analysis.molecular_weight(),                       'isoelectric_point': prot_analysis.isoelectric_point(),                       'amino_acid_percent': prot_analysis.get_amino_acids_percent(),                       'secondary_structure': prot_analysis.secondary_structure_fraction(),                       'flexibility': prot_analysis.flexibility(),                       'gravy': prot_analysis.gravy()                   }                                     return structure_analysis       except:           pass             return None     def comparative_analysis(self, sequence_ids):       results = []             for seq_id in sequence_ids:           if seq_id in self.analysis_results:               analysis = self.analysis_results[seq_id].copy()               analysis['sequence_id'] = seq_id               results.append(analysis)             df = pd.DataFrame(results)             if len(df) > 1:           fig = make_subplots(               rows=2, cols=2,               subplot_titles=("Length Comparison", "GC Content", "Molecular Weight", "Composition Heatmap")           )                     fig.add_trace(               go.Bar(x=df['sequence_id'], y=df['length'], name="Length"),               row=1, col=1           )                     if 'gc_content' in df.columns:               fig.add_trace(                   go.Scatter(x=df['sequence_id'], y=df['gc_content'], mode='markers+lines', name="GC%"),                   row=1, col=2               )                     if 'molecular_weight' in df.columns:               fig.add_trace(                   go.Bar(x=df['sequence_id'], y=df['molecular_weight'], name="MW"),                   row=2, col=1               )                     fig.update_layout(title="Comparative Sequence Analysis", height=600)           fig.show()             return df     def codon_usage_analysis(self, sequence_id):       if sequence_id not in self.sequences:           return None             seq = self.sequences[sequence_id].seq             if len(seq) % 3 != 0:           return None             codons = {}       for i in range(0, len(seq) - 2, 3):           codon = str(seq[i:i+3])           codons[codon] = codons.get(codon, 0) + 1             codon_df = pd.DataFrame(list(codons.items()), columns=['Codon', 'Count'])       codon_df = codon_df.sort_values('Count', ascending=False)             fig = px.bar(codon_df.head(20), x='Codon', y='Count',                    title=f"Top 20 Codon Usage - {sequence_id}")       fig.show()             return codon_df     def motif_search(self, sequence_id, motif_pattern):       if sequence_id not in self.sequences:           return []             seq = str(self.sequences[sequence_id].seq)       positions = []             for i in range(len(seq) - len(motif_pattern) + 1):           if seq[i:i+len(motif_pattern)] == motif_pattern:               positions.append(i)             return positions     def gc_content_window(self, sequence_id, window_size=100):       if sequence_id not in self.sequences:           return None             seq = self.sequences[sequence_id].seq       gc_values = []       positions = []             for i in range(0, len(seq) - window_size + 1, window_size//4):           window = seq[i:i+window_size]           gc_values.append(gc_fraction(window) * 100)           positions.append(i + window_size//2)             fig = go.Figure()       fig.add_trace(go.Scatter(x=positions, y=gc_values, mode='lines+markers',                               name=f'GC Content (window={window_size})'))       fig.update_layout(           title=f"GC Content Sliding Window Analysis - {sequence_id}",           xaxis_title="Position",           yaxis_title="GC Content (%)"       )       fig.show()             return positions, gc_values     def run_comprehensive_analysis(self, sequence_ids):       results = {}             for seq_id in sequence_ids:           if seq_id in self.sequences:               analysis = self.analyze_sequence(seq_id)               self.visualize_composition(seq_id)                             gc_analysis = self.gc_content_window(seq_id)               codon_analysis = self.codon_usage_analysis(seq_id)                             results[seq_id] = {                   'basic_analysis': analysis,                   'gc_window': gc_analysis,                   'codon_usage': codon_analysis               }             if len(sequence_ids) > 1:           comparative_df = self.comparative_analysis(sequence_ids)           results['comparative'] = comparative_df             return results

We define a BioPython AIAgent that allows us to fetch or create sequences, run core analyses (composition, GC%, translation, and protein properties), and visualize results interactively. We also perform pairwise alignments, build phylogenetic trees, scan motifs, profile codon usage, analyze GC with sliding windows, and compare multiple sequences—then bundle everything into one comprehensive pipeline. Check out the FULL CODES here.

agent = BioPythonAIAgent()sample_seqs = agent.create_sample_sequences()for seq_id, _, _ in sample_seqs:   agent.analyze_sequence(seq_id)results = agent.run_comprehensive_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])print("BioPython AI Agent Tutorial Complete!")print("Available sequences:", list(agent.sequences.keys()))print("Available methods:", [method for method in dir(agent) if not method.startswith('_')])

We instantiate the BioPythonAIAgent, generate sample sequences (COVID Spike, Human Insulin, and E. coli 16S), and run a full analysis pipeline. The outputs confirm that our agent successfully performs nucleotide, codon, and GC-content analyses while also preparing comparative visualizations. Finally, we print the list of available sequences and supported methods, indicating that the agent’s full analytical capabilities are now ready for use. Check out the FULL CODES here.

agent.visualize_composition('COVID_Spike')agent.gc_content_window('E_coli_16S', window_size=50)agent.codon_usage_analysis('COVID_Spike')comparative_df = agent.comparative_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])print(comparative_df)motif_positions = agent.motif_search('COVID_Spike', 'ATG')print(f"ATG start codons found at positions: {motif_positions}")tree = agent.create_phylogenetic_tree(sequences=[   str(agent.sequences['COVID_Spike'].seq[:300]),   str(agent.sequences['Human_Insulin'].seq[:300]),   str(agent.sequences['E_coli_16S'].seq[:300])])if tree:   agent.visualize_tree(tree)

We visualize nucleotide composition, scan E. coli 16S GC% in sliding windows, and profile codon usage for the COVID Spike sequence. We then compare sequences side-by-side, search for the “ATG” motif, and build/plot a quick phylogenetic tree from the first 300 nt of each sequence.

In conclusion, we have a fully functional BioPython AI Agent capable of handling multiple layers of sequence analysis, from basic nucleotide composition to codon usage profiling, GC-content sliding windows, motif searches, and even comparative analyses across species. The integration of visualization and phylogenetic tree construction provides both intuitive and in-depth insights into genetic data. Whether for academic projects, bioinformatics education, or research prototyping, this Colab-friendly workflow showcases how open-source tools like Biopython can be harnessed with modern AI-inspired pipelines to simplify and accelerate biological data exploration.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Biopython 生物信息学 AI Agent Python Google Colab 序列分析 数据可视化 NCBI SARS-CoV-2 Bioinformatics Sequence Analysis Data Visualization
相关文章