Introducing four new PHP 5.3 components and Goutte, a simple web scraper
To support symfony 2’s development, Fabien Potencier – the lead developer of the symfony framework – has released four new PHP 5.3 based components:
Though these components will be used by Symfony 2, they’re built to be standalone components that can be easily used in any PHP 5.3 project. To prove that point, Fabien also released a new web scraper/crawler called Goutte which uses these four components, along with four additional components from Zend Framework. It’s a prime example of the flexibility and power that standalone components, along with a willingness to share, can provide.
CssSelector
The first new component, CssSelector, converts CSS selectors to XPath so that the power of XPath can be used with the familiarity of CSS selectors. The component is actually a port of a Python library called lxml and represents a translation from Python to PHP along with the addition of some unit tests.
The use is simple, and is covered in greater detail by Fabien on his blog. The following code, from Fabien’s blog, iterates through a specific anchor tag and prints out the href attribute.
use Symfony\Components\CssSelector\Parser; $document = new \DOMDocument(); $document->loadHTMLFile('http://fabien.potencier.org/articles'); $xpath = new \DOMXPath($document); foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node) { printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href')); }
DomCrawler
After the CssSelector, the obvious next step is to create a component that allows you to take control of any HTML or XML content. The DomCrawler allows you to do just that. Though there’s not yet any real documentation, the unit tests reveal a powerful system for crawling the DOM.
use Symfony\Components\DomCrawler\Crawler; $crawler = new Crawler(); $crawler->addHtmlContent('<html><div class="foo"></div></html>'); $crawler->filter('div')->attr('class') // returns foo
The component has a rich list of methods that can be called to perform tasks on your DOM such as filtering, returning attributes, returning text, calling methods iteratively on nodes, and manipulating link and form elements.
Process
The Process components tackles another issue entirely. Namely, the Process component allows PHP scripts to be run in entirely different processes. In other words, “PhpProcess runs a PHP script in a forked process.” This is done via a simple class wrapper around the proc_* functions.
use Symfony\Components\Process\PhpProcess; $process = new PhpProcess('/path/to/script.php'); $process->run(); echo $process->getOutput();
BrowserKit
Finally, the BrowserKit component brings all of the components together. The BrowserKit makes a request (via a method you define), and then allows you to interact with the page (e.g. click, submit) or retrieve information from the page (via the DomCrawler).
The best way to understand the BrowserKit is to see it in action with Goutte.
Goutte – a screen scraping and web crawling library
Goutte combines the above four components along with Zend Framework’s Date, Uri, Http, and Validate components to form an easy and powerful way to programmatically crawl and interact with web pages.
$client = new Client(); $crawler = $client->request('GET', 'http://www.symfony-project.org/'); // Click on a link $link = $crawler->selectLink('Plugins')->link(); $crawler = $client->click($link); // Read through a list of error messages $nodes = $crawler->filter('ul.error_list'); foreach ($nodes as $node) { echo 'Error: ' . $node->text(); }
Leave a comment
Use the form below to leave a comment:
Responses and Pingbacks
April 23rd, 2010 at 4:38 pm
[…] the php|architect blog today there’s a new post from Ryan Weaver about some of the new components that’ve been added to the Symfony framework […]
April 24th, 2010 at 7:07 pm
I did know about the components but not about “Goutte” to see them working together, excellent info!
April 25th, 2010 at 9:23 am
[…] Introducing four new PHP 5.3 components and Goutte, a simple web scraper | php|architect […]
April 25th, 2010 at 10:08 am
thanks for the article, Ryan
April 25th, 2010 at 6:10 pm
[…] Introducing four new PHP 5.3 components and Goutte, a simple web scraper | php|architect (tags: php) […]
April 26th, 2010 at 9:57 am
Thanks for the great news and article!
I’d really love to see a tutorial where a website (>20k pages) is crawled with a forked process using Goutte.
… or I’ll get there putting a lot of time and energy into it. 🙂
Thanks again!
Flem
April 27th, 2010 at 8:43 am
The crawling can be done really well with Query Path (jQuery like in PHP).
Glad to see this out there as well.
May 3rd, 2010 at 1:12 am
[…] werden. Sie heißen BrowserKit, CssSelector, DomCrawler und Process. Auf der Webseite von php|architect, die man übrigens abonnieren sollte, kann man schon ein paar Dinge darüber lesen. Ich bin […]
May 9th, 2010 at 7:14 pm
[…] Vier neue PHP 5.3 Komponenten vom leitenden Symfony-Entwickler. […]
December 4th, 2010 at 11:35 am
[…] http://www.phparch.com/2010/04/22/four-new-php-5-3-components-and-goutte-a-simple-web-scraper/ パーマリンク トラックバック […]
December 4th, 2010 at 11:35 am
[…] http://www.phparch.com/2010/04/22/four-new-php-5-3-components-and-goutte-a-simple-web-scraper/ パーマリンク トラックバック […]
October 26th, 2011 at 6:19 pm
[…] Ryan Weaver – Introducing four new PHP 5.3 components and Goutte, a simple web scraper andremaha Filed under: Symfony […]