Cover of the book Symfony 5: The Fast Track

Symfony 5: The Fast Track is the best book to learn modern Symfony development, from zero to production. +300 pages in full color showing how to combine Symfony with Docker, APIs, queues & async tasks, Webpack, Single-Page Applications, etc.

Buy printed version
WARNING: You are browsing the documentation for Symfony 2.0 which is not maintained anymore. Consider upgrading your projects to Symfony 5.1.

The DomCrawler Component

The DomCrawler Component

The DomCrawler Component eases DOM navigation for HTML and XML documents.

Note

While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.

Installation

You can install the component in many different ways:

Usage

The Symfony\Component\DomCrawler\Crawler class provides methods to query and manipulate HTML and XML documents.

An instance of the Crawler represents a set (SplObjectStorage) of DOMElement objects, which are basically nodes that you can traverse easily:

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
    </body>
</html>
HTML;

$crawler = new Crawler($html);

foreach ($crawler as $domElement) {
    print $domElement->nodeName;
}

Specialized Symfony\Component\DomCrawler\Link and Symfony\Component\DomCrawler\Form classes are useful for interacting with html links and forms as you traverse through the HTML tree.

Node Filtering

Using XPath expressions is really easy:

$crawler = $crawler->filterXPath('descendant-or-self::body/p');

Tip

DOMXPath::query is used internally to actually perform an XPath query.

Filtering is even easier if you have the CssSelector Component installed. This allows you to use jQuery-like selectors to traverse:

$crawler = $crawler->filter('body > p');

Anonymous function can be used to filter with more complex criteria:

$crawler = $crawler->filter('body > p')->reduce(function ($node, $i) {
    // filter even nodes
    return ($i % 2) == 0;
});

To remove a node the anonymous function must return false.

Note

All filter methods return a new Symfony\Component\DomCrawler\Crawler instance with filtered content.

Node Traversing

Access node by its position on the list:

$crawler->filter('body > p')->eq(0);

Get the first or last node of the current selection:

$crawler->filter('body > p')->first();
$crawler->filter('body > p')->last();

Get the nodes of the same level as the current selection:

$crawler->filter('body > p')->siblings();

Get the same level nodes after or before the current selection:

$crawler->filter('body > p')->nextAll();
$crawler->filter('body > p')->previousAll();

Get all the child or parent nodes:

$crawler->filter('body')->children();
$crawler->filter('body > p')->parents();

Note

All the traversal methods return a new Symfony\Component\DomCrawler\Crawler instance.

Accessing Node Values

Access the value of the first node of the current selection:

$message = $crawler->filterXPath('//body/p')->text();

Access the attribute value of the first node of the current selection:

$class = $crawler->filterXPath('//body/p')->attr('class');

Extract attribute and/or node values from the list of nodes:

$attributes = $crawler
    ->filterXpath('//body/p')
    ->extract(array('_text', 'class'))
;

Note

Special attribute _text represents a node value.

Call an anonymous function on each node of the list:

$nodeValues = $crawler->filter('p')->each(function ($node, $i) {
    return $node->text();
});

The anonymous function receives the position and the node as arguments. The result is an array of values returned by the anonymous function calls.

Adding the Content

The crawler supports multiple ways of adding the content:

$crawler = new Crawler('<html><body /></html>');

$crawler->addHtmlContent('<html><body /></html>');
$crawler->addXmlContent('<root><node /></root>');

$crawler->addContent('<html><body /></html>');
$crawler->addContent('<root><node /></root>', 'text/xml');

$crawler->add('<html><body /></html>');
$crawler->add('<root><node /></root>');

Note

When dealing with character sets other than ISO-8859-1, always add HTML content using the addHTMLContent() method where you can specify the second parameter to be your target character set.

As the Crawler’s implementation is based on the DOM extension, it is also able to interact with native DOMDocument, DOMNodeList and DOMNode objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$document = new \DOMDocument();
$document->loadXml('<root><node /><node /></root>');
$nodeList = $document->getElementsByTagName('node');
$node = $document->getElementsByTagName('node')->item(0);

$crawler->addDocument($document);
$crawler->addNodeList($nodeList);
$crawler->addNodes(array($node));
$crawler->addNode($node);
$crawler->add($document);

This work, including the code samples, is licensed under a Creative Commons BY-SA 3.0 license.