The DomCrawler Component
Warning: You are browsing the documentation for Symfony 2.x, which is no longer maintained.
Read the updated version of this page for Symfony 7.2 (the current stable version).
The DomCrawler component eases DOM navigation for HTML and XML documents.
Note
While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.
Installation
1
$ composer require symfony/dom-crawler
Alternatively, you can clone the https://github.com/symfony/dom-crawler repository.
Note
If you install this component outside of a Symfony application, you must
require the vendor/autoload.php
file in your code to enable the class
autoloading mechanism provided by Composer. Read
this article for more details.
Usage
See also
This article explains how to use the DomCrawler features as an independent component in any PHP application. Read the Symfony Functional Tests article to learn about how to use it when creating Symfony tests.
The Crawler class provides methods to query and manipulate HTML and XML documents.
An instance of the Crawler represents a set of DOMElement objects, which are basically nodes that you can traverse easily:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
var_dump($domElement->nodeName);
}
Specialized Link and Form classes are useful for interacting with html links and forms as you traverse through the HTML tree.
Note
The DomCrawler will attempt to automatically fix your HTML to match the
official specification. For example, if you nest a <p>
tag inside
another <p>
tag, it will be moved to be a sibling of the parent tag.
This is expected and is part of the HTML5 spec. But if you're getting
unexpected behavior, this could be a cause. And while the DomCrawler
isn't meant to dump content, you can see the "fixed" version of your HTML
by dumping it.
Node Filtering
Using XPath expressions is really easy:
1
$crawler = $crawler->filterXPath('descendant-or-self::body/p');
Tip
DOMXPath::query
is used internally to actually perform an XPath query.
Filtering is even easier if you have the CssSelector component installed. This allows you to use jQuery-like selectors to traverse:
1
$crawler = $crawler->filter('body > p');
An anonymous function can be used to filter with more complex criteria:
1 2 3 4 5 6 7 8 9
use Symfony\Component\DomCrawler\Crawler;
// ...
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});
To remove a node the anonymous function must return false.
Note
All filter methods return a new Crawler instance with filtered content.
Both the filterXPath() and filter() methods work with XML namespaces, which can be either automatically discovered or registered explicitly.
Consider the XML below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
<?xml version="1.0" encoding="UTF-8"?>
<entry
xmlns="http://www.w3.org/2005/Atom"
xmlns:media="http://search.yahoo.com/mrss/"
xmlns:yt="http://gdata.youtube.com/schemas/2007"
>
<id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
<yt:accessControl action="comment" permission="allowed"/>
<yt:accessControl action="videoRespond" permission="moderated"/>
<media:group>
<media:title type="plain">Chordates - CrashCourse Biology #24</media:title>
<yt:aspectRatio>widescreen</yt:aspectRatio>
</media:group>
</entry>
This can be filtered with the Crawler
without needing to register namespace
aliases both with filterXPath():
1
$crawler = $crawler->filterXPath('//default:entry/media:group//yt:aspectRatio');
and filter():
1
$crawler = $crawler->filter('default|entry media|group yt|aspectRatio');
Note
The default namespace is registered with a prefix "default". It can be changed with the setDefaultNamespacePrefix() method.
The default namespace is removed when loading the content if it's the only namespace in the document. It's done to simplify the xpath queries.
Namespaces can be explicitly registered with the registerNamespace() method:
1 2
$crawler->registerNamespace('m', 'http://search.yahoo.com/mrss/');
$crawler = $crawler->filterXPath('//m:group//yt:aspectRatio');
Node Traversing
Access node by its position on the list:
1
$crawler->filter('body > p')->eq(0);
Get the first or last node of the current selection:
1 2
$crawler->filter('body > p')->first();
$crawler->filter('body > p')->last();
Get the nodes of the same level as the current selection:
1
$crawler->filter('body > p')->siblings();
Get the same level nodes after or before the current selection:
1 2
$crawler->filter('body > p')->nextAll();
$crawler->filter('body > p')->previousAll();
Get all the child or parent nodes:
1 2
$crawler->filter('body')->children();
$crawler->filter('body > p')->parents();
Note
All the traversal methods return a new Crawler instance.
Accessing Node Values
Access the node name (HTML tag name) of the first node of the current selection (eg. "p" or "div"):
1 2
// returns the node name (HTML tag name) of the first child element under <body>
$tag = $crawler->filterXPath('//body/*')->nodeName();
Access the value of the first node of the current selection:
1
$message = $crawler->filterXPath('//body/p')->text();
Access the attribute value of the first node of the current selection:
1
$class = $crawler->filterXPath('//body/p')->attr('class');
Extract attribute and/or node values from the list of nodes:
1 2 3 4
$attributes = $crawler
->filterXpath('//body/p')
->extract(array('_text', 'class'))
;
Note
Special attribute _text
represents a node value.
Call an anonymous function on each node of the list:
1 2 3 4 5 6
use Symfony\Component\DomCrawler\Crawler;
// ...
$nodeValues = $crawler->filter('p')->each(function (Crawler $node, $i) {
return $node->text();
});
2.3
As seen here, in Symfony 2.3, the each
and reduce
Closure functions
are passed a Crawler
as the first argument. Previously, that argument
was a DOMNode.
The anonymous function receives the node (as a Crawler) and the position as arguments. The result is an array of values returned by the anonymous function calls.
Adding the Content
The crawler supports multiple ways of adding the content:
1 2 3 4 5 6 7 8 9 10
$crawler = new Crawler('<html><body /></html>');
$crawler->addHtmlContent('<html><body /></html>');
$crawler->addXmlContent('<root><node /></root>');
$crawler->addContent('<html><body /></html>');
$crawler->addContent('<root><node /></root>', 'text/xml');
$crawler->add('<html><body /></html>');
$crawler->add('<root><node /></root>');
Note
When dealing with character sets other than ISO-8859-1, always add HTML content using the addHtmlContent() method where you can specify the second parameter to be your target character set.
As the Crawler's implementation is based on the DOM extension, it is also able to interact with native DOMDocument, DOMNodeList and DOMNode objects:
1 2 3 4 5 6 7 8 9 10
$domDocument = new \DOMDocument();
$domDocument->loadXml('<root><node /><node /></root>');
$nodeList = $domDocument->getElementsByTagName('node');
$node = $domDocument->getElementsByTagName('node')->item(0);
$crawler->addDocument($domDocument);
$crawler->addNodeList($nodeList);
$crawler->addNodes(array($node));
$crawler->addNode($node);
$crawler->add($domDocument);
Links
To find a link by name (or a clickable image by its alt
attribute), use
the selectLink()
method on an existing crawler. This returns a Crawler
instance with just the selected link(s). Calling link()
gives you a special
Link object:
1 2 3 4 5
$linksCrawler = $crawler->selectLink('Go elsewhere...');
$link = $linksCrawler->link();
// or do this all at once
$link = $crawler->selectLink('Go elsewhere...')->link();
The Link object has several useful methods to get more information about the selected link itself:
1 2
// returns the proper URI that can be used to make another request
$uri = $link->getUri();
Note
The getUri()
is especially useful as it cleans the href
value and
transforms it into how it should really be processed. For example, for a
link with href="#foo"
, this would return the full URI of the current
page suffixed with #foo
. The return from getUri()
is always a full
URI that you can act on.
Forms
Special treatment is also given to forms. A selectButton()
method is
available on the Crawler which returns another Crawler that matches <button>
or <input type="submit">
or <input type="button">
elements (or an
<img>
element inside them). The string given as argument is looked for in
the id
, alt
, name
, and value
attributes and the text content of
those elements.
This method is especially useful because you can use it to return a Form object that represents the form that the button lives in:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
// button example: <button id="my-super-button" type="submit">My super button</button>
// you can get button by its label
$form = $crawler->selectButton('My super button')->form();
// or by button id (#my-super-button) if the button doesn't have a label
$form = $crawler->selectButton('my-super-button')->form();
// or you can filter the whole form, for example a form has a class attribute: <form class="form-vertical" method="POST">
$crawler->filter('.form-vertical')->form();
// or "fill" the form fields with data
$form = $crawler->selectButton('my-super-button')->form(array(
'name' => 'Ryan',
));
The Form object has lots of very useful methods for working with forms:
1 2 3
$uri = $form->getUri();
$method = $form->getMethod();
The getUri() method does more
than just return the action
attribute of the form. If the form method
is GET, then it mimics the browser's behavior and returns the action
attribute followed by a query string of all of the form's values.
You can virtually set and get values on the form:
1 2 3 4 5 6 7 8 9 10 11 12
// sets values on the form internally
$form->setValues(array(
'registration[username]' => 'symfonyfan',
'registration[terms]' => 1,
));
// gets back an array of values - in the "flat" array like above
$values = $form->getValues();
// returns the values like PHP would see them,
// where "registration" is its own array
$values = $form->getPhpValues();
To work with multi-dimensional fields:
1 2 3 4 5
<form>
<input name="multi[]" />
<input name="multi[]" />
<input name="multi[dimensional]" />
</form>
Pass an array of values:
1 2 3 4 5 6 7 8
// sets a single field
$form->setValues(array('multi' => array('value')));
// sets multiple fields at once
$form->setValues(array('multi' => array(
1 => 'value',
'dimensional' => 'an other value',
)));
This is great, but it gets better! The Form
object allows you to interact
with your form like a browser, selecting radio values, ticking checkboxes,
and uploading files:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
$form['registration[username]']->setValue('symfonyfan');
// checks or unchecks a checkbox
$form['registration[terms]']->tick();
$form['registration[terms]']->untick();
// selects an option
$form['registration[birthday][year]']->select(1984);
// selects many options from a "multiple" select
$form['registration[interests]']->select(array('symfony', 'cookies'));
// fakes a file upload
$form['registration[photo]']->upload('/path/to/lucas.jpg');
Using the Form Data
What's the point of doing all of this? If you're testing internally, you can grab the information off of your form as if it had just been submitted by using the PHP values:
1 2
$values = $form->getPhpValues();
$files = $form->getPhpFiles();
If you're using an external HTTP client, you can use the form to grab all of the information you need to create a POST request for the form:
1 2 3 4 5 6
$uri = $form->getUri();
$method = $form->getMethod();
$values = $form->getValues();
$files = $form->getFiles();
// now use some HTTP client and post using this information
One great example of an integrated system that uses all of this is Goutte. Goutte understands the Symfony Crawler object and can use it to submit forms directly:
1 2 3 4 5 6 7 8 9 10 11 12 13
use Goutte\Client;
// makes a real request to an external site
$client = new Client();
$crawler = $client->request('GET', 'https://github.com/login');
// select the form and fill in some values
$form = $crawler->selectButton('Sign in')->form();
$form['login'] = 'symfonyfan';
$form['password'] = 'anypass';
// submits the given form
$crawler = $client->submit($form);
Selecting Invalid Choice Values
By default, choice fields (select, radio) have internal validation activated
to prevent you from setting invalid values. If you want to be able to set
invalid values, you can use the disableValidation()
method on either
the whole form or specific field(s):
1 2 3 4 5 6
// disables validation for a specific field
$form['country']->disableValidation()->select('Invalid value');
// disables validation for the whole form
$form->disableValidation();
$form['country']->select('Invalid value');