|
|
Dans la rubrique
Nombre de visites : 13 Mise en ligne : 05/2008 Dernière modif : 07/2008 Parsing PHP codeDuring the development of a code documentor (phpSimpleDoc), I had to solve this problem :
From a set of files containing php code, build the data structures containing : classes, interfaces, functions, constants ; class fields, class methods, class constants ; function parameters. The data structure must store all the informations about the elements (doc comment, file, line...) and the relation between the elements (extends, implements, overrides etc.). It can be decomposed in two steps : - Identify the elements and their caracteristics - Build the data structures expressing the links. This page deals with the first step. Reflection APII first tried the Reflection API ; the interest is to have nothing to do ! The API reflection isolates all the php elements, and provides a function to retrieve the comments (getDocComment()).
I found several (little) problems with Reflection API, exposed on page Notes on PHP Reflection API ; they can be fixed with regular expressions, but the Reflection API is not a solution to write a code documentor, because the code to analyze needs to be loaded and interpreted by the php interpreter, and in certain cases, this triggers fatal errors. To handle this, a possibility was to use Runkit_Sandbox class, and try to deal with the fatal errors, but it's available through a PECL extension, and I wanted to rely only on the PHP standard distribution.
Pear's PHP_ParserSo I searched for a php parser, and tried PHP_Parser, version 0.2.1.I faced problems, listed in page Using Pear PHP Parser. Using PHP_Parser leads to - write code to patch it, - write code to adapt the structures to the needs of phpSimpleDoc. It also has the inconvenient of loading a heavy file Core.php (412 Ko).
So I didn't keep it, but this lead me to the token array, and php function token_get_all(), which converts a string containing php code to an array containing the tokens, elementary pieces of php code. See page Working with PHP tokens
Writing the parsingI first thought parsing the code with regular expressions, but it looks difficult ; look for example the code :
/**
A comment containing valid php code :
function f1(){}
define('CONSTANT1', 12);
*/
function f1($param1 = 'function f1(){}'){}
Comments and parameter default values can contain valid code of element declaration, and I didn't feel the courage to try to handle that with regex.
For the limited needs of a code documentor (no need to really parse the code, just need to retrieve the elements and their characteristics), the code was finally simple to write, using both token array and regular expressions ; no need to have compiler notions for that. The method is : - Loop on the token array, identify the top-level elements : T_CLASS, T_INTERFACE, T_FUNCTION, T_STRING (for constants).
- Traverse also classes and interfaces to retrieve the class fields, methods and constants ( T_FUNCTION, T_VARIABLE, T_CONST).
- When a token expressing an element declaration is found, reconstitute the string used to declare the element. - Use regular expressions to parse the element declaration, and retrieve the desired informations. Extracting the declarations from the token array was done with these functions :
Downloadsparsing-php-code.zip (11 Ko) contains code to parse PHP code :
|