tig12.net
Site personnel

Admin
Rubriques proches
PHP

Nombre de visites : 2052
Mise en ligne : 05/2008
Dernière modif : 03/2009

 Parsing PHP code

During the development of a code documentor (phpSimpleDoc), I had to solve this problem :

From a set of files containing php code, build the data structures containing : classes, interfaces, functions, constants ; class fields, class methods, class constants ; function parameters. The data structure must store all the informations about the elements (doc comment, file, line...) and the relation between the elements (extends, implements, overrides etc.).

This can be decomposed in two steps :
- Identify the elements and their caracteristics
- Build the data structures expressing the links.
This page deals with the first step.

Reflection API

I first tried the Reflection API ; the interest is to have nothing to do ! The API reflection isolates all the php elements, and provides a function to retrieve the comments (getDocComment()).
I found several (little) problems with Reflection API, exposed on page Notes on PHP Reflection API ; they can be fixed with regular expressions, but the Reflection API is not a solution to write a code documentor, because the code to analyze needs to be loaded and interpreted by the php interpreter, and in certain cases, this triggers fatal errors.
To handle this, a possibility was to use Runkit_Sandbox class, and try to deal with the fatal errors, but it's available through a PECL extension, and I wanted to rely only on the PHP standard distribution.

Pear's PHP_Parser

So I searched for a php parser, and tried PHP_Parser, version 0.2.1.
I faced problems, listed in page Using Pear PHP Parser.
Using PHP_Parser leads to
- write code to patch it,
- write code to adapt the structures to the needs of phpSimpleDoc.
It also has the inconvenient of loading a heavy file Core.php (412 Ko).
So I didn't keep it, but this lead me to the token array, and php function token_get_all(), which converts a string containing php code to an array containing the tokens, elementary pieces of php code. See below and page Working with PHP tokens

Regular expressions

I also thought of parsing the code with regular expressions, but it looks difficult ;
look for example this php code :
/** 
 A comment containing valid php code :
 function f1(){}
 define('CONSTANT1', 12);
*/
function f1($param1 = 'function f1(){}'){}
Comments and parameter default values can contain valid code of element declaration, and I didn't feel the courage to try to handle that with regex.

Regular expressions pose an other problem : they don't permit to retrieve line numbers (as far as I know), and I absolutely wanted this information for the code documentor. I thought of ways to handle that (using flag PREG_OFFSET ; transforming the files adding line numbers at the beginning of lines...), but none seemed sympathetic.

Writing the parsing

For the limited needs of a code documentor (no need to really parse the code, just need to retrieve the elements and their characteristics), the code was finally simple to write, using the token array to identify the code elements, and regular expressions to parse the declarations ; no need to have compiler notions for that.
The job is done by 3 classes which are part of phpsimpledoc's code : ldeMainLoader, ppTokenArray and ppParsePhpCode.
  • ldeMainLoader->loadData() uses token_get_all() to retrieve the token array, loops on the token array, and identify the top-level elements : T_CLASS, T_INTERFACE, T_FUNCTION, T_STRING (for constants).
  • When a top level element is found, its declaration is retrieved using ppTokenArray::getSequence() (see below) ; this method returns an array containing the index of the last token of the declaration, so the loop can go on from the end of the previously caught element.
    If the top element is a class or an interface, method ppTokenArray::getBalancedCurlyBrace() (see below) is used to find its last token ;
  • Method ppTokenArray::getSequence() also returns the string containing the declaration.
    Once the declaration is isolated, parsing the declarations with regular expressions is easy, except for one case : function parameters ; this was done using the token array. Declaration parsing is done by methods of class ppParsePhpCode.
Extracting the declarations from the token array was done with these methods of class ppTokenArray :
  • getBalancedCurlyBraces() : to find a sub-array (of the token array) containing code between a balanced set of curly braces ; this permits to get the sub array containing the code of a class or a function.
  • getSequence(), to isolate the code of a declaration :
    This function takes as parameters :
    - $ta : a token array
    - $x : the index in the token array from where the identification of the declaration starts ; this is the index containing the main element of the declaration (for ex, T_CLASS in a class declaration)
    - $before and $after : list of tokens that can be found before and after $x
    - $stop : Token that stops the declaration
    Example of use : isolate the declaration of a class
        $before = array(T_ABSTRACT, T_FINAL);
        $after = array(T_STRING, T_EXTENDS, T_IMPLEMENTS, ',');
        $stop = '{';
        $seq = getSequence($ta, $x, $before, $after, $stop);
    
    So this class declaration :
    /** a comment */
    abstract class Class1 extends Class2{}
    
    corresponds to the following portion of array :
        [1] => T_DOC_COMMENT
        [2] => T_WHITESPACE
        [3] => T_ABSTRACT
        [4] => T_WHITESPACE
        [5] => T_CLASS
        [6] => T_WHITESPACE
        [7] => T_STRING
        [8] => T_WHITESPACE
        [9] => T_EXTENDS
        [10] => T_WHITESPACE
        [11] => T_STRING
        [12] => {
        [13] => }
        [14] => T_WHITESPACE
    
    Here, $x = 5 (index of element containing the T_CLASS) ;
    From there, it goes backward in the array, skips the T_WHITESPACE, and stops when it meets a token not in array $before.
    If the token is a T_DOC_COMMENT, also returns it.
    Then, starting again from 5, it goes forward in the array, skips the T_WHITESPACE, and stops either when it meets the $stop token, or when it meets a token not in array $after.
The return looks like that :
Array(
    ['string'] => 'abstract class Class1 extends Class2'
    ['comment'] => 'a comment'
    ['commentLine'] => 2
    ['lastIndex'] => 12
)

Note : token_get_all() has a small bug (in PHP 5.2.3) : if a comment doesn't start exactly by 2 asterisks (ex : /*** a comment */), it is seen as a T_COMMENT instead of a T_DOC_COMMENT. So this method considers as a comment both tokens. So if an element declaration has no doc comment, but is preceeded by a normal comment, the normal comment will be considered as a doc comment.

Downloads

If you want to use this parsing in your own program, you can retrieve the code of the classes through subversion :
svn co https://phpsimpledoc.svn.sourceforge.net/svnroot/phpsimpledoc/trunk
You can also browse the svn repository and download the code from there.
Useful classes are :
  • helpers/dataLoaders/default/ldeMainLoader.php : example of code using the parsing (from phpSimpleDoc) ; not usable as is, but easily modifiable to incorporate in a program
  • libext/parsers/php/ppTokenArray.php : code manipulating the token array
  • libext/parsers/php/ppParsePhpCode.php : code parsing the declarations
  • libext/utils/uTokenArray.php : utilities to dump token arrays
This code is GPL


Forum
Parsing PHP code
Par osisus - 8 août 2008

Great !

I’m building a hook script to compute statistics of PHPUnit test cases on a commit, and I needed a PHP code parser...

But I faced a difficulty : it is not possible to download your file.. I get a 404 error.

Could you correct your URL ?

Parsing PHP code
Par osisus - 8 août 2008
Finally, I found a bypass solution. I download your phpSimpleDoc from SourceForge, and collect the four files you’ve mentionned from this archive...
Parsing PHP code
Par tig12 - 23 août 2008
Yes, you’re right, the best thing to do is to retrieve the files from sourceforge. I modified the download part of the page.
--Site écrit avec SPIP--Licence du contenu publié sur ce site--