HTML Parsing in PHP using Simple HTML DOM parser

By | January 22, 2017

HTML parsing is the process of extracting relevant information like title of the web page, paragraphs, headings, links etc. from the web page/HTML code. It is harder and not practical to scrape a website using regex.

HTML parsing
HTML parsing is very easy task with the help of SimpleHtmlDom library. For users who are unfamiliar with SimpleHtmlDom, It is a PHP library that allows you to parse HTML files. The parser is very tolerant with malformed HTML. It allows you to manipulate HTML in a very easy way using JQuery like selectors.

In this article you will learn how to get started with HTML parsing and certain frequently used SimpleHtmlDom library code snippets. The article also includes the code to demonstrate how HTML parsing can be implemented easily using this library in PHP.

Download the library for HTML parsing:

The SimpleHtmlDom library is open source and can be freely downloaded from sourceforge.net. After downloading, extract the zip file. There are several files in the directory but the file we need is simple_html_dom.php file; the other files are of examples and documentation.

How to use SimpleHtmlDom library?

#1. Load the HTML and Create DOM object

#2. How to find HTML elements?

Using find() method, you can find elements with particular name, ID, class, attributes etc. Below are the examples:

To learn more on HTML parsing and SimpleHtmlDom library, you can explore the SimpleHtmlDom library documentation.

Practical Example:

To see this library into action, We are going to extract top 250 movies information from the IMDB website. This is just for an example.

The code and comments are self-explanatory. After parsing movie information, you can easily display it on web page or store it into database or export it to CSV/Excel Spreadsheet. The code may not work well if website structure changes.
IMDB HTML Parsing

This is a just basic information on SimpleHtmlDom which is the recommended library for HTML parsing using PHP. If you have any questions, ask in comment section.

Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *