Web Scraping with PHP: Library and Regex Examples
In this article, we'll cover what web scraping is and how to accomplish it in different scenarios using either a pre-built PHP scraping library or regular expressions.
What is Web Scraping?
Web scraping is a method used for extracting and collecting data from a web page, generally accomplished using a pre-written, automated script.
There are several scenarios in which web scraping with PHP can be useful, including gathering data, displaying images, running price comparisons, and much more. Whatever your end goal is, it's best to do this sparingly and with good intentions.
When scraping content from external web pages, it's best practice to do so in a non-invasive fashion and on a limited basis. If you're relying solely on external information that isn't yours to begin with, you may need to rethink your methods. Additionally, constantly scraping an external page could cause performance issues for the target website.
Web Scraping PHP Library
The easiest and most effective way to scrape a web page is by using a pre-built library with PHP (if you're using PHP as your primary language).
There are many useful libraries out there, but the one I've found the easiest to work with that also provides the most accurate data is the PHP Simple HTML DOM Parser on SourceForge. It's free to download and takes seconds to get going.
Once you've downloaded the ZIP file, extract the contents and look for the file named simple_html_dom.php
. This is the only file you'll need to include to get it working. Place the file somewhere in your code. In this example, we'll place it in the same directory as our script:
require "simple_html_dom.php";
Now, the parser library is available for use within your PHP script.
Next, let's choose a URL to scrape data from and get that data using the library's built-in file_get_html()
method:
$html = file_get_html("https://www.google.com");
Here, we're assigning the results pulled from Google's website to a new variable, $html
, from which we can now parse DOM elements.
For example, we can now pull the target page title from the returned DOM elements stored in our $html
variable and output it to the screen:
$title = $html->find("title", 0);
echo $title->plaintext;
The output is the title of Google's homepage, Google.
You can also scrape images using the library. This code snippet will pull the first img
tag found and output it's source to the screen:
$images = $html->find("img", 0);
echo $image->src;
// https://www.google.com/logos/doodles/2021/celebrating-hisaye-yamamoto-6753651837109044-l.png
To pull and loop through all images on the web page, you could use a foreach
loop on the $html
variable:
foreach($html->find("img") as $image) {
echo $image->src . "<br/>";
}
// image 1 source
// image 2 source
// ...
Web Scraping with Regular Expressions
If you're looking into web scraping with PHP using a custom solution, you can use regular expressions to parse the DOM elements.
Let's start by pulling the page contents with PHP's built-in file_get_contents()
function and passing in a URL:
$data = file_get_contents("https://www.google.com");
Next, we'll pull the contents in between the opening and closing title
tags of the page output using the preg_match()
method for regular expression matching:
preg_match("/<title>([^<]+)<\/title>/i", $data, $matches);
$title = $matches[1];
echo $title;
Similarly, we can use regular expressions for finding other tags, like images and their source values:
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$image = $matches[1];
echo $image;
// https://www.google.com/logos/doodles/2021/celebrating-hisaye-yamamoto-6753651837109044-l.png
The above code snippet pulls the first img
tag's source attribute value and outputs it to the screen.
Using the preg_match_all()
PHP method, we can find all img
tag occurrences in the DOM and output their sources to the screen:
preg_match_all('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
for ($i = 0; $i < sizeof($matches[1]); $i++) {
echo $matches[1][$i] . '<br/>';
}
Get Metadata with PHP
If you just want to pull metadata from a web page, no additional libraries or long code snippets are needed. You can simply use PHP's built-in get_meta_tags()
method that accepts a single argument for external pages, the web page URL:
$data = get_meta_tags("https://www.google.com");
The result is an array of key-value pairs, where the key name is the meta tag's name
attribute value, and the key's value is the content
attribute value.
In this case, Google only returns a meta description and robot information. Google doesn't have much metadata to view but play around with this tag on other websites to see what information you can get.
Basically, any properly written meta
tag on a web page will return successfully.
Conclusion
Using the SourceForge library from the first example is, by far, the easiest solution when web scraping with PHP. The functionality comes built into the library, allowing for easy access with very little code.
Regular expressions are also useful when you're looking for a more customized solution. This solution could also be used in cases where you need multiple tag attribute values like image widths, heights, alt attribute values, etc.
Written by: Josh Rowe
Created: May 04, 2021