首页 > 编程语言> > PHP简单DOMDocument抓取排除td类

PHP简单DOMDocument抓取排除td类

2019-10-25 12:39:01 作者：互联网

我只是试图获取所有的< td>位于< tr>内部的元素数据元素.我的问题是因为我试图抓取的表结构是我需要排除所有具有COLLSPAN属性的元素,即< td collspan = 12>
从下面的代码可以看出,获取表数据非常简单,但是由于表结构的原因,我需要排除所有collspan属性.

<?php

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
    $game_doc->loadHTML($html);
    libxml_clear_errors(); //remove error
    $xpath = new DOMXPath($game_doc);

    // Modify the XPath query to match the content
    foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
        $cells = $rows->getElementsByTagName('td');
        //$cells2 = $rows->getElementsByTagName('th');
        echo '<pre>';
         //@ signs are added due to table structure
        //Get scrapped columns
        echo $dayDateBye[] = $cells->item(0)->textContent;
        echo $homeTeam[] = $cells->item(1)->textContent;
        echo $awayTeam[] = $cells->item(2)->textContent;
        echo $venue[] = $cells->item(3)->textContent;
        echo $timeGMT[] = $cells->item(5)->textContent;
        echo $timeZA[] = $cells->item(10)->textContent;
        echo '</pre>';
    }
}

在这里,您可以看到表格结构,其中显示了5个奇数行的灯具,然后在新的一周开始时更改了结构.我可以识别的跳过结构变化的元素都是< td collspan = 12>.元素.这很棘手,因为TD元素没有类名,而仅具有用于标识它的元素.

任何输入表示赞赏.

解决方法:

您可以按标签长度跳过那些

<?php

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
    $game_doc->loadHTML($html);
    libxml_clear_errors(); //remove error
    $xpath = new DOMXPath($game_doc);

    // Modify the XPath query to match the content
    foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
        $cells = $rows->getElementsByTagName('td');
        if( $cells->length > 1 ){
            //$cells2 = $rows->getElementsByTagName('th');
            echo '<pre>';
             //@ signs are added due to table structure
            //Get scrapped columns
            echo $dayDateBye[] = $cells->item(0)->textContent;
            echo $homeTeam[] = $cells->item(1)->textContent;
            echo $awayTeam[] = $cells->item(2)->textContent;
            echo $venue[] = $cells->item(3)->textContent;
            echo $timeGMT[] = $cells->item(5)->textContent;
            echo $timeZA[] = $cells->item(10)->textContent;
            echo '</pre>';
        }
    }
}

?>

标签：web-scraping,domdocument,html,php
来源： https://codeday.me/bug/20191025/1928820.html