The DomCrawler Component

Caution: You are browsing the documentation for Symfony version 2.0 which is not maintained anymore. If some of your projects are still using this version, consider upgrading to Symfony 2.6.

The DomCrawler Component

The DomCrawler Component eases DOM navigation for HTML and XML documents.

Note

While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.

Installation

You can install the component in many different ways:

Usage

The Crawler class provides methods to query and manipulate HTML and XML documents.

An instance of the Crawler represents a set (SplObjectStorage) of DOMElement objects, which are basically nodes that you can traverse easily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
    </body>
</html>
HTML;

$crawler = new Crawler($html);

foreach ($crawler as $domElement) {
    print $domElement->nodeName;
}

Specialized Link and Form classes are useful for interacting with html links and forms as you traverse through the HTML tree.

Node Filtering

Using XPath expressions is really easy:

1
$crawler = $crawler->filterXPath('descendant-or-self::body/p');

Tip

DOMXPath::query is used internally to actually perform an XPath query.

Filtering is even easier if you have the CssSelector Component installed. This allows you to use jQuery-like selectors to traverse:

1
$crawler = $crawler->filter('body > p');

Anonymous function can be used to filter with more complex criteria:

1
2
3
4
$crawler = $crawler->filter('body > p')->reduce(function ($node, $i) {
    // filter even nodes
    return ($i % 2) == 0;
});

To remove a node the anonymous function must return false.

Note

All filter methods return a new Crawler instance with filtered content.

Node Traversing

Access node by its position on the list:

1
$crawler->filter('body > p')->eq(0);

Get the first or last node of the current selection:

1
2
$crawler->filter('body > p')->first();
$crawler->filter('body > p')->last();

Get the nodes of the same level as the current selection:

1
$crawler->filter('body > p')->siblings();

Get the same level nodes after or before the current selection:

1
2
$crawler->filter('body > p')->nextAll();
$crawler->filter('body > p')->previousAll();

Get all the child or parent nodes:

1
2
$crawler->filter('body')->children();
$crawler->filter('body > p')->parents();

Note

All the traversal methods return a new Crawler instance.

Accessing Node Values

Access the value of the first node of the current selection:

1
$message = $crawler->filterXPath('//body/p')->text();

Access the attribute value of the first node of the current selection:

1
$class = $crawler->filterXPath('//body/p')->attr('class');

Extract attribute and/or node values from the list of nodes:

1
2
3
4
$attributes = $crawler
    ->filterXpath('//body/p')
    ->extract(array('_text', 'class'))
;

Note

Special attribute _text represents a node value.

Call an anonymous function on each node of the list:

1
2
3
$nodeValues = $crawler->filter('p')->each(function ($node, $i) {
    return $node->text();
});

The anonymous function receives the position and the node as arguments. The result is an array of values returned by the anonymous function calls.

Adding the Content

The crawler supports multiple ways of adding the content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$crawler = new Crawler('<html><body /></html>');

$crawler->addHtmlContent('<html><body /></html>');
$crawler->addXmlContent('<root><node /></root>');

$crawler->addContent('<html><body /></html>');
$crawler->addContent('<root><node /></root>', 'text/xml');

$crawler->add('<html><body /></html>');
$crawler->add('<root><node /></root>');

Note

When dealing with character sets other than ISO-8859-1, always add HTML content using the addHTMLContent() method where you can specify the second parameter to be your target character set.

As the Crawler's implementation is based on the DOM extension, it is also able to interact with native DOMDocument, DOMNodeList and DOMNode objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$document = new \DOMDocument();
$document->loadXml('<root><node /><node /></root>');
$nodeList = $document->getElementsByTagName('node');
$node = $document->getElementsByTagName('node')->item(0);

$crawler->addDocument($document);
$crawler->addNodeList($nodeList);
$crawler->addNodes(array($node));
$crawler->addNode($node);
$crawler->add($document);

These methods on the Crawler are intended to initially populate your Crawler and aren't intended to be used to further manipulate a DOM (though this is possible). However, since the Crawler is a set of DOMElement objects, you can use any method or property available on DOMElement, DOMNode or DOMDocument. For example, you could get the HTML of a Crawler with something like this:

1
2
3
4
5
$html = '';

foreach ($crawler as $domElement) {
    $html .= $domElement->ownerDocument->saveHTML($domElement);
}

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License .