tig12.net
Site personnel

Admin
Rubriques proches
PHP

Nombre de visites : 13
Mise en ligne : 05/2008
Dernière modif : 07/2008

 Parsing PHP code

During the development of a code documentor (phpSimpleDoc), I had to solve this problem :

From a set of files containing php code, build the data structures containing : classes, interfaces, functions, constants ; class fields, class methods, class constants ; function parameters. The data structure must store all the informations about the elements (doc comment, file, line...) and the relation between the elements (extends, implements, overrides etc.).

It can be decomposed in two steps :
- Identify the elements and their caracteristics
- Build the data structures expressing the links.
This page deals with the first step.

Reflection API

I first tried the Reflection API ; the interest is to have nothing to do ! The API reflection isolates all the php elements, and provides a function to retrieve the comments (getDocComment()).
I found several (little) problems with Reflection API, exposed on page Notes on PHP Reflection API ; they can be fixed with regular expressions, but the Reflection API is not a solution to write a code documentor, because the code to analyze needs to be loaded and interpreted by the php interpreter, and in certain cases, this triggers fatal errors.
To handle this, a possibility was to use Runkit_Sandbox class, and try to deal with the fatal errors, but it's available through a PECL extension, and I wanted to rely only on the PHP standard distribution.

Pear's PHP_Parser

So I searched for a php parser, and tried PHP_Parser, version 0.2.1.
I faced problems, listed in page Using Pear PHP Parser.
Using PHP_Parser leads to
- write code to patch it,
- write code to adapt the structures to the needs of phpSimpleDoc.
It also has the inconvenient of loading a heavy file Core.php (412 Ko).
So I didn't keep it, but this lead me to the token array, and php function token_get_all(), which converts a string containing php code to an array containing the tokens, elementary pieces of php code. See page Working with PHP tokens

Writing the parsing

I first thought parsing the code with regular expressions, but it looks difficult ; look for example the code :
/** 
 A comment containing valid php code :
 function f1(){}
 define('CONSTANT1', 12);
*/
function f1($param1 = 'function f1(){}'){}
Comments and parameter default values can contain valid code of element declaration, and I didn't feel the courage to try to handle that with regex.

For the limited needs of a code documentor (no need to really parse the code, just need to retrieve the elements and their characteristics), the code was finally simple to write, using both token array and regular expressions ; no need to have compiler notions for that.
The method is :
- Loop on the token array, identify the top-level elements : T_CLASS, T_INTERFACE, T_FUNCTION, T_STRING (for constants).
- Traverse also classes and interfaces to retrieve the class fields, methods and constants (T_FUNCTION, T_VARIABLE, T_CONST).
- When a token expressing an element declaration is found, reconstitute the string used to declare the element.
- Use regular expressions to parse the element declaration, and retrieve the desired informations.

Extracting the declarations from the token array was done with these functions :
  • getTABalance() : to find a sub-array containing code between a balanced set of symbols (ex : open and close curly braces) ; this permits to get the sub array containing the code of a class or a function.
  • getTASequence(), to isolate the code of a declaration :
    This function takes as parameters :
    - $ta : a token array
    - $x : the index in the token array from where the identification of the declaration starts ; this is the index containing the main element of the declaration (for ex, T_CLASS in a class declaration)
    - $before and $after : list of tokens that can be found before and after $x
    - $stop : Token that stops the declaration
    Example of use : isolate the declaration of a class
        $before = array(T_ABSTRACT, T_FINAL);
        $after = array(T_STRING, T_EXTENDS, T_IMPLEMENTS, ',');
        $stop = '{';
        $seq = $this->getTASequence($ta, $x, $before, $after, $stop);
    
    So this class declaration :
    /** a comment */
    abstract class Class1 extends Class2{}
    
    corresponds to the following portion of array :
        [1] => T_DOC_COMMENT
        [2] => T_WHITESPACE
        [3] => T_ABSTRACT
        [4] => T_WHITESPACE
        [5] => T_CLASS
        [6] => T_WHITESPACE
        [7] => T_STRING
        [8] => T_WHITESPACE
        [9] => T_EXTENDS
        [10] => T_WHITESPACE
        [11] => T_STRING
        [12] => {
        [13] => }
        [14] => T_WHITESPACE
    
    Here, $x = 5 (index of element containing the T_CLASS) ;
    From there, it goes before in the array, skips the T_WHITESPACE, and stops when it meets a token not in array $before.
    If this array is a T_DOC_COMMENT, it takes it as part of the declaration.
    Then, starting again from 5, it goes after in the array, skips the T_WHITESPACE, and stops either when it meets the $stop token, or when it meets a token not in array $after.
Once the declaration is isolated, parsing the declarations with regular expressions is easy, except for one case : function parameters ; this was done using the token array.

Downloads

parsing-php-code.zip (11 Ko) contains code to parse PHP code :
  • auTokenArray.php : code manipulating the token array
  • auParsePhpCode.php : code parsing the declarations
  • uTokenArray.php : utilities to dump token arrays
  • ldeMainLoader.php : example of code using the parsing (from phpSimpleDoc) ; not usable as is, but easily modifiable to incorporate in a program
This code is GPL


--Site écrit avec SPIP--Licence du contenu publié sur ce site--