Introduction to Web Scraping With Laravel
Many times, we need data from websites that were not built by us, we could need this data for some research, for some analytics, or even for complex topics like machine learning. This data can be easily gotten using a process called web scraping.
What Is Web Scraping?
Wikipedia defines web scraping as:
Web scraping is data scraping used for extracting data from websites.
I would say Web Scraping is the totality of the processes involved in getting data from another website. This data can be exported as CSV — or JSON, or even processed before being rendered — or returned, or even just stored to a database.
Knowing how to scrape data is a very useful skill as it means you don’t have to wait on other websites to provide APIs for you just to get basic data from them. Web scraping is, however, limited to only pages of a website that don’t require any authentication or authorization. Thus, it is not a security risk but might be considered illegal and heavily frowned upon as some websites have copyright data, that should not be redistributed without permission.
Ensure the website being scraped doesn’t have any legal constraints against web scraping before scraping their data.
This article doesn’t show how to export the data, or process it. It focuses solely on techniques to get data from websites, using Laravel and the Goutte PHP Package.
Getting Started
To get started, we need to create our Laravel app. We can do this with the following command.
laravel new Scraper
This command creates a new folder called Scraper
, where all the files our project needs to run are located. However, this command would only work if you have the Laravel installer installed globally on your computer. You can find how to set it up here.
Next, your Laravel application can be previewed by starting its server.
cd Scraperphp artisan serve
Having the project all set up and running, we may install the Goutte package now.
Install Goutte
Goutte
is a Laravel package that makes web scraping easy.
composer require weidner/goutte
The Goutte
uses Browserkit - this simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically. It also uses DomCrawler - this eases DOM navigation, and selection for HTML and XML documents, it may also be used for DOM manipulation, although it's not advised. And finally, it uses HTTP Client - to make requests.
Register Provider And Alias
To let Laravel know where to find the Goutte
service, add the following lines to the app.php
config file. The file can be found in the config
folder.
'providers' => [
// Some more providers
Weidner\Goutte\GoutteServiceProvider::class,
],
'aliases' => [
// Some other code
'Goutte' => Weidner\Goutte\GoutteFacade::class,
],
Using The Goutte Package
Create a controller where all your code will be in. This makes code easier to read, and hence, easier to manage and maintain.
php artisan make:controller ScraperController
The package is ready for us now, having been registered already. Use the Goutte
package in the controller.
<?php
namespace App\Http\Controllers;
use Illuminate\Http\Request;
use Goutte\Client;
class ScraperController extends Controller
{
// Other code
}
Create an index
method in the ScraperController
controller, to test our installation and request.
public function index()
{
$client = new Client();
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
return $website->html();
}
In this example, we’ve been able to confirm that the package works. Even though we’re just getting the page and rendering it.
Accessing The DOM
In the above example, we saw that we had access to the HTML of the requested website. We could as well access other elements in the DOM.
You can select specific parts of the page using the filter()
method. It returns a Symfony\Component\DomCrawler\Crawler
object that has a nodes
property. This property is an array of element nodes, elements that match the given condition. To learn more about element nodes, see here.
For example, you can select all h2
tags like this:
$client = new Client();
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
$companies = $website->filter('h2');
To be more specific, you can select elements in other elements. For example, you can select all anchor elements, a
that are nested in an h2
tag.
$client = new Client();
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
$companies = $website->filter('h4 > a');
You can iterate over the returned array of nodes by appending the each()
method. It takes a callback, where you can do whatever you want with each node.
In this example, the text in each node is dumped.
$client = new Client();
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
$companies = $website->filter('h4 > a')->each(function ($node) {
dump($node->text());
});
Alternatively, you can just return all these texts, so you’d have an array.
$client = new Client();
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
$companies = $website->filter('h4 > a')->each(function ($node) {
return $node->text();
});
Accessing Children Elements
Some elements represent whole blocks, and they have children in them. For example, when viewing products on an e-commerce platform, each product has its own div
. This div
has children elements that represent the name, price, ratings, and other information.
These elements can be easily accessed using the children()
method on the node.
$companies = $website->filter('.company')->each(function ($node) {
$node->children()->each(function ($child) {
return $child->text();
});
});
In this example, divs with the class of ‘company’ are selected and iterated over. For each div, its children element have their text returned.
Selecting Only Specific Elements
In the example above, we have access to all the children nodes, the ones we want and the other ones. We can easily select only children nodes that match a class or element or id.
$companies = $website->filter('.company')->each(function ($node) {
$node->children()->each(function ($child) {
if ($child->matches('.address')) {
return 'Address is ' . $child->text();
}
});
});
In this example, only children nodes with the address class are returned.
Alternatively, you may use a simpler way that just selects all children elements that match the class or id.
$companies = $website->filter('.company')->each(function ($node) {
$node->children('.address')->each(function ($child) {
return 'Address is ' . $child->text();
});
});
Selecting Elements By Position
In the example above, we filtered down the number of nodes we had. But, some other times, you want just the first element or the last element or just select one by its position on the list.
$companies = $website->filter('.company')->each(function ($node) {
return [
'first_item' => $node->children()->eq(0)->text(),
'first_item_again' => $node->children()->first()->text(),
'second_item' => $node->children()->eq(1)->text(),
'last_item' => $node->children()->last()->text(),
];
});
Accessing Node Values
You may access node properties like the nodeName, text content (as seen multiple times above), and some other node attributes like class, id, src (for images), href (for links), and all.
$companies = $website->filter('.company')->each(function ($node) {
return [
'nodeName' => $node->nodeName(),
'attributes' => [
'class' => $node->attr('class'),
'id' => $node->attr('id'),
],
'html' => $node->html(),
'outerHTML' => $node->outerHtml()
];
});
Clicking Links
The selectLink
lets you select a link that may be clicked on, or have some data extracted from it. Clicking on the link returns the result of the click, usually another webpage, or maybe a file - for a download link.
// Send request to the website
$website = $client->request('GET', 'https://www.businesslist.com.ng/category/interior-design/city:lagos');
// Select the link
$link = $website->selectLink('Nekta Group')->link();
// Click the link
$result = $client->click($link);
return $result->html();
More Methods
There are many more methods available to access and manipulate the DOM. You can check them here. Using an IDE like PHPStorm lets you see all available methods easily.
Problems With Web Scraping
Sometimes, you notice the scraper starts malfunctioning, after a while. This could be due to any of these:
- The target website could have changed its structure. This breaks your code as the scraper gets data out of target element nodes, that might not be present again, or have been moved.
- The target website could have blocked your IP. If you’re running your scraper too many times, this might have gotten the attention of the target website, leading to them blocking you.
Summary
Now you know how to get all sorts of data from web pages using Laravel.
If you have any questions or relevant advice, please get in touch with me to share them.
To read more of my articles or follow my work, you can connect with me on LinkedIn, Twitter, and Github. It’s quick, it’s easy, and it’s free!