Windows.  Viruses.  Notebooks.  Internet.  office.  Utilities.  Drivers

28.3K

I've seen a lot of xml parsers but didn't touch web programming. Now I want to find out and learn with you how to make a simple xml parser in php.

What for? Necessary!

No, well, actually: xml files are a very useful thing. And any professional should ... no, should not, but must know how to work with them. We want to become professionals, don't we? If you are on my blog, then you have such a desire.

We assume that we know what XML is and will not describe it here. Well, if we don't know, we can easily find out here: http://ru.wikipedia.org/wiki/XML

While looking for ways to parse XML in PHP, I discovered a simple set of PHP functions for working with XML files called "XML Parser Functions". Parsing starts with parser initialization by calling the xml_parser_create function:

$xml_parser = xml_parser_create();

Then we need to tell the parser which functions will process the xml tags and text information that it comes across during the parsing process. Those. you need to install some handlers:

xml_set_element_handler($xml_parser, "startElement", "endElement");

This function is responsible for setting the element start and element end handlers. For example, if a combination occurs in the text of an xml file, then the startElement function will work when the parser finds the element, and the endElement function will work when it is found.

The startElement and endElement functions themselves take several parameters, according to the php documentation:

But how to read data from a file? We have not yet seen a single parameter for this in any of the functions! And more about this later: reading the file is the responsibility of the programmer, i.e. we have to use the standard functions for working with files:

Opened the file. And now you need to read it line by line and feed the read lines to the xml_parse function:

XML Error: ".xml_error_string(xml_get_error_code($xml_parser)); echo " at line ".xml_get_current_line_number($xml_parser); break; ) ) ?>

Here we note two very important things. The first is that the third parameter of the xml_parse function is to pass the flag for reading the last line (true if the line is the last, false if not). The second is that, as in any business, we must watch for mistakes here. The xml_get_error_code and xml_error_string functions are responsible for this. The first function receives the error code, and the second one returns a text description of the error based on the received code. What will happen as a result of the error - we will consider later. Not less than useful feature xml_get_current_line_number will tell us the number of the currently processed line in the file.

And as always, we must release the resources occupied by the system. For XML parsing, this is the xml_parser_free function:

xml_parser_free($xml_parser);

Here, we have considered the main functions. It's time to see them in action. To do this, I came up with an xml file with a very simple structure:




123

71234567890

Let's call this file data.xml and try to parse it with the following code:

Element: $name
"; // element name $depth++; // increase depth so the browser shows indentation foreach ($attrs as $attr => $value) ( ​​echo str_repeat(" ", $depth * 3); // indentation // display name attribute and its value echo "Attribute: ".$attr." = ".$value."
"; ) ) function endElement($parser, $name) ( global $depth; $depth--; // decrease depth ) $depth = 0; $file = "data.xml"; $xml_parser = xml_parser_create(); xml_set_element_handler ($xml_parser, "startElement", "endElement"); if (!($fp = fopen($file, "r"))) ( die("could not open XML input"); ) while ($data = fgets ($fp)) ( if (!xml_parse($xml_parser, $data, feof($fp))) ( echo "
XML Error: "; echo xml_error_string(xml_get_error_code($xml_parser)); echo " at line ".xml_get_current_line_number($xml_parser); break; ) ) xml_parser_free($xml_parser); ?>

As a result of the simplest script we developed, the browser displayed the following information in its window:

Element: ROOT
Element: INFO
Attribute: WHO = mine
Element: ADDRESS

Attribute: KVARTIRA=12
Attribute: DOM=15
Element: PHONE

Let's try to corrupt the XML file by replacing the tag On , and leaving the closing tag the same:

Element: ROOT
Element: INFO
Attribute: WHO = mine
Element: ADDRESS
Attribute: ULICA = my street!!
Attribute: KVARTIRA=12
Attribute: DOM=15
Element: TELEPHONE

XML Error: Mismatched tag at line 5

Wow! Error messages work! And quite informative.

Oh, I forgot one more thing... We didn't display the text contained inside the address and phone tags. We fix our shortcoming - we add a text handler using the xml_set_character_data_handler function:

xml_set_character_data_handler($xml_parser, 'stringElement');

And add the handler function itself to the code.

Now we will study working with XML. XML is a format for exchanging data between sites. It is very similar to HTML, only XML allows its own tags and attributes.

Why is XML needed for parsing? Sometimes it happens that the site you need to parse has an API that allows you to get what you want without much effort. Therefore, immediately advice - before parsing the site, check if it has an API.

What is an API? This is a set of functions with which you can send a request to this site and get the desired response. This answer most often comes in XML format. So let's start studying it.

Working with XML in PHP

Let's say you have XML. It can be in a string, stored in a file, or served on request to a specific URL.

Let the XML be stored in a string. In this case, you need to create an object from this line using new SimpleXMLElement:

$str = " Kolya 25 1000 "; $xml = new SimpleXMLElement($str);

Now we have in a variable $xml an object with parsed XML is stored. By accessing the properties of this object, you can access the content of the XML tags. How exactly - we will analyze a little lower.

If the XML is stored in a file or returned by accessing a URL (which is most often the case), then you should use the function simplexml_load_file which makes the same object $xml:

Kolya 25 1000

$xml = simplexml_load_file(file path or url);

Working methods

In the examples below, our XML is stored in a file or URL.

Let the following XML be given:

Kolya 25 1000

Let's get the name, age and salary of an employee:

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

As you can see, the $xml object has properties corresponding to the tags.

You may have noticed that the tag does not appear anywhere in circulation. This is because it is the root tag. You can rename it, for example, to - and nothing will change:

Kolya 25 1000

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

There can be only one root tag in XML, just like the root tag in plain HTML.

Let's modify our XML a bit:

Kolya 25 1000

In this case, we get a chain of calls:

$xml = simplexml_load_file(file path or url); echo $xml->worker->name; //displays "Kolya" echo $xml->worker->age; //outputs 25 echo $xml->worker->salary; //outputs 1000

Working with Attributes

Let some data be stored in attributes:

Number 1

$xml = simplexml_load_file(file path or url); echo $xml->worker["name"]; //displays "Kolya" echo $xml->worker["age"]; //outputs 25 echo $xml->worker["salary"]; //outputs 1000 echo $xml->worker; //prints "Number 1"

Tags with hyphens

In XML, tags (and attributes) with a hyphen are allowed. In this case, such tags are accessed like this:

Kolya Ivanov

$xml = simplexml_load_file(file path or url); echo $xml->worker->(first-name); //displays "Kolya" echo $xml->worker->(last-name); //displays "Ivanov"

Loop iteration

Let now we have not one worker, but several. In this case, we can iterate over our object with a foreach loop:

Kolya 25 1000 Vasya 26 2000 Peter 27 3000

$xml = simplexml_load_file(file path or url); foreach ($xml as $worker) ( echo $worker->name; //prints "Kolya", "Vasya", "Petya" )

From object to normal array

If you don't feel comfortable working with an object, you can convert it to a normal PHP array with the following trick:

$xml = simplexml_load_file(file path or url); var_dump(json_decode(json_encode($xml), true));

More information

Parsing based on sitemap.xml

Often, a site has a sitemap.xml file. This file stores links to all pages of the site for the convenience of indexing them by search engines (indexing is, in fact, parsing the site by Yandex and Google).

In general, we should not care much why this file is needed, the main thing is that if it exists, you can not climb the pages of the site by any tricky methods, but simply use this file.

How to check the presence of this file: let's parse the site site.ru, then refer to site.ru/sitemap.xml in the browser - if you see something, then it is there, and if you don't see it, then alas.

If there is a sitemap, then it contains links to all pages of the site in XML format. Feel free to take this XML, parse it, separate links to the pages you need in any way convenient for you (for example, by parsing the URL that was described in the spider method).

As a result, you get a list of links for parsing, all that remains is to go to them and parse the content you need.

Read more about the sitemap.xml device in wikipedia.

What do you do next:

Start solving problems at the following link: tasks for the lesson.

When everything is decided - go to the study of a new topic.

XML parser is a program that extracts data from the source xml file and saves or uses for subsequent actions.

Why are xml parsers needed?

First of all, because the xml format itself is popular among computer standards. The XML file looks like this:

those. in fact there are tags, there are some rules which tags should follow each other.

Reason for popularity xml files is that it is well readable by a person. And the fact that it is relatively easy to process in programs.

Cons of xml files.

The downside is primarily a large number of disk space occupied by this data. Due to the fact that tags that are constantly repeated, with large amounts of data, they take up relatively many megabytes, which simply need to be downloaded from the source and then processed. Are there alternatives? There are, of course, but still, xml and xml parsers are one of the simplest and most reliable and technologically popular formats today.

How are XML parsers written?

Parsers are written in programming languages. As they say, they are written for everyone, but not for some anymore. It should be understood that there are programming languages ​​that already have built-in libraries for parsing xml files. But in any case, even if there is no library, you can always find a suitable library for this case and use it to extract data from a file.

Globally, there are 2 different approaches to parsing xml files.

The first is to load the xml file completely into memory and then do data extraction manipulations.

The second is the streaming option. In this case, the programming language defines certain tags that need to be responded to by the functions of the created xml parser, and the programmer himself decides what to do if a particular tag is found.

The advantage of the first approach is speed. I immediately loaded the file, then quickly ran through the memory and found what I needed and most importantly, it was easy to program. but there is a minus and a very important one - this is

a large amount of memory is required for operation. Sometimes, I would even say it often happens that it is simply impossible to process and parse an xml file, i.e. create an xml parser so that it works correctly according to the first method. Why is that? Well, for example, the limitation for 32-bit applications under Windows allows the program to occupy a maximum of 2 gigabytes of memory - no more.

However, programming inline is difficult. The complexity with a sufficiently serious extraction grows many times, which accordingly affects both the timing and the budget.

Validity of xml files and parsers.

Everything would be fine with xml files and xml parsers, but there is a problem. In view of the fact that "any schoolboy" can create an xml file, but in reality it is (because a lot of code is written by schoolchildren, invalid files appear, i.e. incorrect ones. What does this mean and what is it fraught with? The biggest problem , this is that it is simply impossible sometimes to correctly parse an invalid file.For example, its tags are not closed as expected by the standard, or for example, the encoding is set incorrectly.Another problem is that if, for example, you make a parser on .net, you can create so-called wrappers , and the most annoying thing is that you make such a wrapper, and then you read the file that the "student" created, and the file is invalid and impossible to read. Therefore, you have to get angry and resort to very, very unpopular options for parsing such files. \u003d because many people create xml files without using standard libraries and with complete disgust for all xml file standards. It is difficult for customers to explain this. They are waiting for the result - an xml parser that converts data from the original file to another format.

How to create xml parsers (first option)

There is such a query language for XML data as Xpath. This language has two editions, we will not delve into the features of each version. A better understanding of this language will show examples of how to use it to extract data. For example.

//div[@class="supcat guru"]/a

what this request does. It takes all a tags that have a backbone that contains the text catalog.xml?hid= and that a tag must be a child div whose class is supcat guru.

Yes, for the first time it may not be clear enough, but you can still figure it out if you want. The starting point for me is http://ru.wikipedia.org/wiki/XPath and I advise you.

The other day I began to rework my company's internal reporting system, the general structure of which I wrote about not so long ago. Without prejudice, I will say that I have grown above myself in terms of PHP, and, as a result, I realized that the algorithm of the system is crooked enough for me to rewrite it.

Prior to this, the XML document was parsed using functions that were borrowed from PHP 4th version. However, PHP5 gave the world a very handy thing called SimpleXML. How to work with him, and will be discussed today.

It is worth starting with the fact that SimpleXML is a separate plug-in, and therefore it must be connected in advance on the server used.

Now we can work!

To process the document, we use the simplexml_load_file() function. As a parameter, it is passed the address of a file in the eXtended Markup Language (XML - Your K.O.) format.

The beauty of this function is that you can easily transfer a file from any server to it. Thus, we have the opportunity to process external xml uploads (for example, Yandex-XML or third-party RSS feeds).

The output of the function is an array. The pitfall that I encountered is that XML can have a clumsy structure, and therefore I advise you to start with figurative tracing and output an array in order to understand how the function processed it. After that, you can start processing the received data.

For example, I will take a simple construction from here:


>
>
> PHP: The Parser Appears >
>
>
> Ms. coder >
> Onlivia Actora >
>
>
> Mr. coder >
> El ActÓr >
>
> > Mr. parser > > John Doe > > >
>
Thus, it is a language. It's still a programming language. Or
is it a scripting language? It's all revealed in this documentary
similar to a horror movie.
>
>
> PHP solves all my problems on the web >
>
7>
5>
PG > >
>

Let it be the export.xml file, which lies right at the root of my server, along with the script that processes it.
The array is built according to the structure of the DOM elements in the XML document. Processing starts from the root. In order to receive the name Ms. Coder, we have to build the following path: $xml->movies->movie->characters->character->name.
I draw your attention to the fact that we choose a specific value. This is where the record of this kind of character comes from - do not forget that we are working with an array!

Like any array, our data can be processed using a foreach loop. The code will be like this:

$xml = simplexml_load_file("export.xml" ) ; //uploaded the file
$ttl = $xml -> movies -> movie -> title ; //get the header. it is one, so you don’t need to set either another value

foreach ($xml -> movies -> movie -> caracters as $crc ) // and now let's work in dynamics
{
//print the names of the heroes
$name = $crc -> caracter -> name ;
echo(" $name
"
) ;
}

This code will put the text “PHP: The Parser Appears” into the $ttl variable, and then display the names of the characters line by line on the screen
Ms. Coder, Mr. Coder, Mr. parser.

Now we will study working with XML. XML is a format for exchanging data between sites. It is very similar to HTML, only XML allows its own tags and attributes.

Why is XML needed for parsing? Sometimes it happens that the site you need to parse has an API that allows you to get what you want without much effort. Therefore, immediately advice - before parsing the site, check if it has an API.

What is an API? This is a set of functions with which you can send a request to this site and get the desired response. This answer most often comes in XML format. So let's start studying it.

Working with XML in PHP

Let's say you have XML. It can be in a string, stored in a file, or served on request to a specific URL.

Let the XML be stored in a string. In this case, you need to create an object from this line using new SimpleXMLElement:

$str = " Kolya 25 1000 "; $xml = new SimpleXMLElement($str);

Now we have in a variable $xml an object with parsed XML is stored. By accessing the properties of this object, you can access the content of the XML tags. How exactly - we will analyze a little lower.

If the XML is stored in a file or returned by accessing a URL (which is most often the case), then you should use the function simplexml_load_file which makes the same object $xml:

Kolya 25 1000

$xml = simplexml_load_file(file path or url);

Working methods

In the examples below, our XML is stored in a file or URL.

Let the following XML be given:

Kolya 25 1000

Let's get the name, age and salary of an employee:

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

As you can see, the $xml object has properties corresponding to the tags.

You may have noticed that the tag does not appear anywhere in circulation. This is because it is the root tag. You can rename it, for example, to - and nothing will change:

Kolya 25 1000

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

There can be only one root tag in XML, just like the root tag in plain HTML.

Let's modify our XML a bit:

Kolya 25 1000

In this case, we get a chain of calls:

$xml = simplexml_load_file(file path or url); echo $xml->worker->name; //displays "Kolya" echo $xml->worker->age; //outputs 25 echo $xml->worker->salary; //outputs 1000

Working with Attributes

Let some data be stored in attributes:

Number 1

$xml = simplexml_load_file(file path or url); echo $xml->worker["name"]; //displays "Kolya" echo $xml->worker["age"]; //outputs 25 echo $xml->worker["salary"]; //outputs 1000 echo $xml->worker; //prints "Number 1"

Tags with hyphens

In XML, tags (and attributes) with a hyphen are allowed. In this case, such tags are accessed like this:

Kolya Ivanov

$xml = simplexml_load_file(file path or url); echo $xml->worker->(first-name); //displays "Kolya" echo $xml->worker->(last-name); //displays "Ivanov"

Loop iteration

Let now we have not one worker, but several. In this case, we can iterate over our object with a foreach loop:

Kolya 25 1000 Vasya 26 2000 Peter 27 3000

$xml = simplexml_load_file(file path or url); foreach ($xml as $worker) ( echo $worker->name; //prints "Kolya", "Vasya", "Petya" )

From object to normal array

If you don't feel comfortable working with an object, you can convert it to a normal PHP array with the following trick:

$xml = simplexml_load_file(file path or url); var_dump(json_decode(json_encode($xml), true));

More information

Parsing based on sitemap.xml

Often, a site has a sitemap.xml file. This file stores links to all pages of the site for the convenience of indexing them by search engines (indexing is, in fact, parsing the site by Yandex and Google).

In general, we should not care much why this file is needed, the main thing is that if it exists, you can not climb the pages of the site by any tricky methods, but simply use this file.

How to check the presence of this file: let's parse the site site.ru, then refer to site.ru/sitemap.xml in the browser - if you see something, then it is there, and if you don't see it, then alas.

If there is a sitemap, then it contains links to all pages of the site in XML format. Feel free to take this XML, parse it, separate links to the pages you need in any way convenient for you (for example, by parsing the URL that was described in the spider method).

As a result, you get a list of links for parsing, all that remains is to go to them and parse the content you need.

Read more about the sitemap.xml device in wikipedia.

What do you do next:

Start solving problems at the following link: tasks for the lesson.

When everything is decided - go to the study of a new topic.

If you notice an error, select a piece of text and press Ctrl + Enter
SHARE: