php – 从嵌入html的xml中提取xml
作者:互联网
我试图让xml在这里呈现http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml,但它有点棘手,因为他们不给它任何支持.目的是将xml发送到php以便使用xml.
有人能暗示一下吗?
解决方法:
通过HTML呈现的XML不是真正的XML也不是真的.
你正在寻找的东西叫做textContent in DOMDocument.那只会给你那个HMTL的文字.就像在浏览器中显示“as text”一样.
所以你需要做的就是将HTML文档加载到DOMDocument
.因为它包含错误,所以使用了内部错误:
$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);
下一部分暗示了有关被抓取页面的具体知识.在您的情况下,XML是所有div-tag的所述文本内容,其中class属性为“xml-tag”*后跟*后面带有id“ResultView”的标记.
可以使用xpath查询轻松获取这些标记,然后将其文本内容存储到数组中:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
$buffer[] = $node->textContent;
}
所以现在剩下的就是创建一个新的DOMDocument并将XML缓冲区加载到其中,做一些很好的格式化和输出:
$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');
这些大约20行代码产生以下输出:
<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
<EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<TITLE/>
<STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
</STUDY_REF>
<DESIGN>
<DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
<SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
</SAMPLE_DESCRIPTOR>
<LIBRARY_DESCRIPTOR>
<LIBRARY_NAME>4008297</LIBRARY_NAME>
<LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED NOMINAL_LENGTH="250"/>
</LIBRARY_LAYOUT>
</LIBRARY_DESCRIPTOR>
<SPOT_DESCRIPTOR>
<SPOT_DECODE_SPEC>
<READ_SPEC>
<READ_INDEX>0</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Forward</READ_TYPE>
<BASE_COORD>1</BASE_COORD>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>1</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Reverse</READ_TYPE>
<RELATIVE_ORDER follows_read_index="0"/>
</READ_SPEC>
</SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>
</DESIGN>
<PLATFORM>
<ILLUMINA>
<INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
</ILLUMINA>
</PLATFORM>
<PROCESSING/>
</EXPERIMENT>
<SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
<IDENTIFIERS>
<PRIMARY_ID>ERA119046</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
</IDENTIFIERS>
</SUBMISSION>
<STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
<DESCRIPTOR>
<STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
<STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
<STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
<CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
<STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
</DESCRIPTOR>
</STUDY>
<SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
<SAMPLE_NAME>
<COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
<TAXON_ID>119602</TAXON_ID>
<SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
</SAMPLE_NAME>
<SAMPLE_LINKS>
<SAMPLE_LINK>
<ENTREZ_LINK>
<DB>biosample</DB>
<ID>859730</ID>
</ENTREZ_LINK>
</SAMPLE_LINK>
</SAMPLE_LINKS>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>Strain</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>Sample Description</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-StrainOrLine</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Sex</TAG>
<VALUE>not applicable</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Species</TAG>
<VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<RUN_SET>
<RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
<IDENTIFIERS>
<PRIMARY_ID>ERR109334</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
</EXPERIMENT_REF>
<Pool>
<Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
</Pool>
</RUN>
</RUN_SET>
</EXPERIMENT_PACKAGE>
因此,不要重新发明轮子,只需了解现有工具.它有时比第一眼看上去更容易.
标签:html,ncbi,php,xml 来源: https://codeday.me/bug/20190901/1780286.html