0

I'm trying to get the values of the following table. I tried both curl/regex (I know it's not recommended) and DOM separately, but wasn't able to get the values properly.

There are multiple rows in the page, so I'll need to use a foreach. I need an exact match of the structure below.

<tr>
    <td width="75" style="NS">
        <img src="NS" width="64" alt="INEEDTHISVALUE">
    </td>
    <td style="NS">
        <a href="NS">NS</a>
    </td>
    <td style="NS">INEEDTHISVALUETOO</td>
</tr>

NS = Non-static values. They change for each td and a since it's a colored (inline css) table. They may contain special characters like ; / or numbers/alphabetical characters.

I'm using simple_html_dom class which can be found here : http://htmlparsing.com/php.html

I'm using the code below to get all td's, but I need more specific output (I included the table row above)

What I've tried so far :

$html = file_get_html("URL");
foreach($html->find('td') as $td) {
    echo $td."<br>";
}

REGEX & CURL

$site = "URL";
$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);
preg_match_all('@<tr><td width="75" style="(.*?)"><img src="/folder/link/(.*?)" width="64" alt="(.*?)"></td><td style="(.*?)"><a href="/folder2/link2/(.*?)">(.*?)</a></td><td style="(.*?)">(.*?)</td></tr>@', $site, $arr);
var_dump($arr); // returns empty array, WHY?

1 Answer 1

1

You can do it like this without a library:

$results = array();
$doc = new DOMDocument();
$doc->loadHTML($site);
$xpath = new DOMXPath($doc);

foreach ($xpath->query('//tr') as $tr) {
    $results[] = array(
        'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),
        'td_text' => $xpath->query('td[last()]', $tr)->item(0)->nodeValue
    );
}

print_r($results);

It will give you:

Array
(
    [0] => Array
        (
            [img_alt] => INEEDTHISVALUE 1
            [td_text] => INEEDTHISVALUETOO 1
        )

    [1] => Array
        (
            [img_alt] => INEEDTHISVALUE 2
            [td_text] => INEEDTHISVALUETOO 2
        )

)

Relevant documentation: PHP: DOMXPath::query

2
  • It works, thank you. But I can't load an external html file with that way, I'll look into the documentation to do that. Thanks!
    – salep
    Commented Jun 7, 2015 at 7:44
  • But it doesn't, I think it's being broken by the HTML file. I get an error like this : Notice: DOMDocument::loadHTML(): Namespace prefix g is not defined in Entity, line: 167 in /Applications/MAMP/htdocs/fetch/test.php on line 153 Warning: DOMDocument::loadHTML(): Tag g:plusone invalid in Entity, line: 167 in /Applications/MAMP/htdocs/fetch/test.php on line 153 Fatal error: Call to a member function getAttribute() on null in /Applications/MAMP/htdocs/fetch/test.php on line 158
    – salep
    Commented Jun 7, 2015 at 9:13

Not the answer you're looking for? Browse other questions tagged or ask your own question.