OPT Hacker's Guide
From Invenzzia wiki
- This article is a stub. You can help us by expanding it.
This article is an introduction to the OPT compiler. It is being written for everyone who are interested in how the compiler works and how to modify it. If you are an ordinary programmer, probably you won't never have to know all those things. However, if you are going to read this article, be sure you already know something about OPT itself, because I assume that you are familiar with its features.
Contents |
Compiler overview
In fact, the template compiler is the biggest part of Open Power Template. The end-user classes contain less than 10% of total OPT source code. The rest is the compiler, instructions and other things that are processed during the compilation. So, we have to deal with a pretty big stuff. The compiler files are located in several places:
-
/Opt/Compiler/- here you can find the main class (Opt_Compiler_Class) together with two extra base classes for instructions and data formats. -
/Opt/Xml- the XML tree classes. -
/Opt/Format- the data formats. -
/Opt/Instruction- the instruction processors.
The compiler contains its own XML parser and processes the templates in three stages:
- Parsing the input document.
- Processing the XML tree.
- Building the output document.
Some issues you have to remember:
- The recursion is your enemy! Avoid it everywhere you can, using stacks and queues.
- The recursion can be avoided everywhere.
- Pay attention to the system resources. It is very easy to write a script that is unrunnable on most of the servers.
- Hermetize the compiler data by storing them in private and protected fields.
- Read the coding style guide in the OPL manual.
Compiler initialization
When you create a compiler object, it must copy a lot of data from the main parser or the other compiler. The constructor reference is Opt_Compiler_Class([Opt_Class|Opt_Compiler_Class] $parent). Using the semi-private method Opt_Class::_getList, it gets the lists of registered instructions, functions, components, etc. Moreover, it initializes the instruction processor objects and stores them in the $this->_processors array. The references are also stored in the assotiative arrays $this->_instructions and $this->_attributes that map various XML tags and attributes to the proper processors.
Next, the compiler has to rebuild some regular expressions according to the current library configuration. At the end, it checks whether it can work correctly.
Compilation
The template compilation in OPT consists of three stages:
- Source code parsing and XML tree building.
- Processing the XML tree nodes.
- Linking the tree to the PHP output code.
The instructions do not generate the PHP code directly to the output. Instead, they assign the code snippets to various code buffers associated with each XML node. The programmer can control only the second stage, by writing new instructions. Stages 1 and 3 do not support any end-user add-ons.
The main compilation algorithm is contained in Opt_Compiler_Class::compile() method. At this level, the support for the template inheritance is created. The compiler repeats the stages 1 and 2 till the whole inheritance tree is processed. The opt:extend instruction sets the node parameters that notify the algorithm, what template or snippet must be loaded next. Note that the algorithm constantly overwrites the $tree variable that keeps the current XML tree:
$tree = $this->_stage1($code, $extend, $mode);
unset($code);
// Stage 2 - PHP tree processing
$this->_stack = array();
$this->_stage2($tree);
$this->set('escape', NULL);
unset($this->_stack);
Later, the compiler checks, whether the opt:extend instruction created a node variable snippet that specifies the next template or snippet in the inheritance chain. First, the algorithm tries to match any of the existing snippet names and if it fails, tries to load the template source from an external file using Opt_Class::_getSource() method.
In the inheritance chain, the last template becomes the main XML tree that will be used to generate the output document. The other ones were used only to register the snippets. So, after the loop ends its job, $tree variable points to the tree root node. Now the compiler must store some information in the output document, what templates it was built of:
if(sizeof($this->_dependencies) > 0)
{
$this->_addDependencies($tree);
}
And finally, the linking. For the dynamic template inheritance, OPT might create some extra directories in compileDir directory. Moreover, if the caching system supports dynamic cache (see opt:dynamic), the file with the PHP code of the dynamic parts is stored at $compiledTemplateName.'.dyn'.
The whole algorithm is surrounded with the recursion checker that protects against infinite recursions etc. Furthermore, it is enclosed within a huge try...catch block that captures all the exceptions, resets the compiler to the initial state and throws the captured exception again, back to the script.
Stage 1
Here, we'll take a closer look at the parsing algorithm. Basically, it takes the raw source XML code, finds the tags and returns an XML tree to the caller. In order to start, we must know, what nodes can compose a tree and what they represent:
-
Opt_Xml_Root- the root of every tree, represents a single XML document. It also holds the references to DTD and document prolog. -
Opt_Xml_Element- represents an XML tag, provides the interface to manage attributes (Opt_Xml_Attribute). -
Opt_Xml_Text- this node does not mean exactly the same, as in DOM. As the static text in the templates may contain expressions in curly brackets (i.e.{$foo}), it was decided to introduceOpt_Xml_Textnode to manipulate the text with the expressions (easier to move, easier to remove...). It may containOpt_Xml_Cdata,Opt_Xml_ExpressionandOpt_Xml_Comment. -
Opt_Xml_Cdata- any static text -
Opt_Xml_Comment- an XML comment. It is necessary to be kept, because the programmer may request to print it in the output document. -
Opt_Xml_Expressiom- represents an OPT expression in the static text.
Within the tree, we can also use Opt_Xml_Attribute (tag attributes), Opt_Xml_Dtd and Opt_Xml_Prolog, however they are not a part of the main tree structure.
Typical XML document is composed of the following parts:
<?xml PROLOG ?>
<!DOCTYPE ... >
white characters or comments
<root node>
the rest of the document
</root node>
white characters or comments
The parsing algorithm can be found in Opt_Compiler_Class::_stage1() method. It begins with scanning for the XML prolog and DTD. We do not use regular expressions, because they are not good enough to detect invalid items between the valid content. Instead, the methods make use of the simplest strpos() and loops. The prolog parsing is delegated to a separate method: Opt_Compiler_Class::_compileProlog() which returns an array of prolog attributes. Once we get it without any exception, we simply create an XML prolog object and register it in a tree. After scanning the white characters, we reach the DTD section. Contrary to other XML parsers, OPT currently ignores the content of DTD, but this behaviour will be changed in the future. For now, the DTD content is simply copied to the Opt_Xml_Dtd object.
Note that this code is executed only if we are not in the quirks mode. Depending on the state of the OPT directive, the compiler configures the rest of the parser to follow either quirks or XML mode.
Later, we get back to the regular expressions. Firstly, OPT looks for the CDATA sections with preg_split(). The returned array is scanned for the occurences of CDATA delimiters. Using them, the parser determines, which text is a part of CDATA and which is not. The latter one is scanned for comments then, using the same technique. The remaining code is scanned for tags with preg_match_all(). The regular expression is generated by the class constructor according to the OPT settings, and it returns the following columns (they are described in the source code, too):
// Find XML tags
preg_match_all($tagExpression, $subgroups[$i], $result, PREG_SET_ORDER);
/*
* Output field description for $result array:
* 0 - original content
* 1 - tag content (without delimiters)
* 2 - /, if enclosing tag
* 3 - name
* 4 - arguments (5 in quirks mode)
* 5 - /, if enclosing tag without subcontent (6 in quirks mode)
*/
The last two items have different indices in the quirks mode, so the access to them is done through the index variables $attributeCell and $endingSlashCell created at the beginning of the method. Using loops, we scan through the result and:
- For opening tag: we copy the remaining text to the last opened text node and create a new
Opt_Xml_Elementobject. If the slash occurs at the end of the tag, we also set thesinglenode variable to true in order to remember this fact and reproduce it in the output. - For closing tag: we have to check whether the tags are closed in the proper order, by simple comparison of the names. Once we are sure that this is all right, we jump out to the parent node.
As you take a closer look, you should notice that we are using some helper methods here, like _treeTextCompile() or _treeJumpOut(). They have been created, because there operations are a bit compilcated and must include several factors (well, jumping out is rather short, but... never mind :)). Another word of explaination should be said about the text between the tags. In order not to spam the limited memory resources, we do not order returning it via the regular expressions. Instead, we look for the position of the currently processed tag with strpos() and we cut off the text from the last position (pointed by $offset) to the new tag and put it as a new XML text node.
Finally, we run the checker that ensures that Opt_Xml_Root contains at least one Opt_Xml_Element node. Please note that this is done only if the suitable configuration directive is enabled. After the tests, we return the tree to the caller.
Stage 2
In the stage two, we go through the tree using the preorder traversal and perform various operations on the tree nodes. This stage can be partially controlled by the instruction processors and by the instruction programmers. The original algorithm uses recursion, and the early versions of OPT used it, too. However, due to the depth of the XML tree for the whole document, some users complained about Nesting level too deep errors and later the algorithm was rewritten. The current code is based on the queue and the stack represented by $queue and $stack respectively. The queue contains the nodes that need to be processed, whereas the stack allows to restore some settings once we get back to the upper tag.
For each node, we may define two actions:
- Preprocessing - executed before traversing through the node children.
- Postprocessing - executed after traversing through the node children.
Because we do not have a true recursion here, the latter action has been moved to _doPostprocess() method, because it needs to be used twice and the code is not too short. The code runs in an infinite loop which is broken manually. First, we dequeue the next item from the current queue, if possible. If the queue is not empty, we should get another XML node to be processed. With the switch statement, we decide, what action to perform, depending on the node type. For Opt_Xml_Element we check, whether it belongs to one of "restricted" namespaces. If it really belongs to one, we try to find a processor and return it the control over the parsing. Please note that if the element has some children, the following code is executed:
$stack->push(array($item, $queue, $pp));
$pp = NULL;
$queue = new SplQueue;
foreach($item as $child)
{
$queue->enqueue($child);
}
continue 2;
Basically, it pushes the current queue to the stack and puts the children of the current element to a new queue, which becomes the "main queue". The $pp variable contains the special tag attributes (in the restricted namespace, i.e. opt) that need to be run during the postprocessing. The only difference is the instruction tag parsing, because the processors generate their queues manually and we simply need to use their queues, instead of building a new one that does not necessarily suits the programmers.
In the stage two, we can notice lots of occurences of a mysterious node variable hidden. If set to true, it causes the node not to be displayed, together with the node children. By default, all the tree nodes are hidden, and the tree traversal makes them visible. The effect can be shown on the following example:
visible text
<opt:idontexist>
not visible text
</opt:idontexist>
visible text
The opt:idontexist tag is not recognized by any processor, so the stage 2 does not process it and its children, in this case a text node. The text node within the tag remains hidden and does not appear in the final output.
Once we have finished the tag, we call the postprocessing for it, and check the status of the current queue. If it is empty, it's time to pop the last element from the stack and restore the older queue. For the popped element, we call the postprocessing, too.
After the second stage, the compiler should have enough information about the visibility of the particular nodes in the output document, and the code buffers in the nodes should be populated with the PHP code snippets by the instruction processors.
In this stage, the instruction processors may manipulate the XML tree or generate the PHP code snippets which are appended to the tree nodes. Each node contains several code buffers representing the places, where the PHP code may appear around the node. The details can be found in the project documentation, so we will not mention them here.
Stage 3
The third compilation stage links the XML tree into a valid, compiled PHP script. The general algorithm concept is very similar to the previous stage - with stacks and queues we traverse through the tree, recognize the node type and create the output code for them. The code buffers are concatenated into single PHP code blocks:
$output .= $item->buildCode(Opt_Xml_Buffer::TAG_BEFORE, Opt_Xml_Buffer::TAG_OPENING_BEFORE,
Opt_Xml_Buffer::TAG_OPENING_AFTER, Opt_Xml_Buffer::TAG_CONTENT_BEFORE);
After finishing this stage, the $output variable contains the compiled template content, which is saved to a file and executed by OPT.
Expression parsing
OPT contains a powerful expression parser, which does not only replace the operators to their PHP equivalents, but also checks the syntax of the input expressions and performs some manipulations. Again, we use quite complex techniques to aviod the explicite recursion while parsing the sub-expressions. Let's take a look at a sample expression:
$a is 5 * ($b + $c)
Is is composed of smaller and simpler sub-expressions:
-
$a is EXPRESSION -
5 * EXPRESSION -
$b + $c
In the Opt_Compiler_Class::compileExpressions(), the algorithm tokenizes the input with a big regular expressions that recognizes operators, strings, numbers and identifiers and stores them in different array elements, so that they could be processed much easier:
preg_match_all('/(?:'.
$this->_rSingleQuoteString.'|'.
$this->_rBacktickString.'|'.
$this->_rHexadecimalNumber.'|'.
$this->_rDecimalNumber.'|'.
$this->_rLanguageVar.'|'.
$this->_rVariable.'|'.
$this->_rOperators.'|'.
$this->_rIdentifier.')/x', $expr, $match);
Then, we go to a loop which looks for the expression boundaries, like parentheses, or commas. Here, the concept of a translation unit appears for the first time, which is nothing more, but such a sub-expression. Each translation unit has its own, unique number which is used to address it later. The $tu array contains the arrays of tokens for each translation unit, and $tuid represents the number of the currently parsed translation unit. If we find a neutral token, we simply copy it to the translation unit array:
default:
$tu[$tuid][] = $match[0][$i];
The four symbols begin the new sub-expression:
case '[':
case '(':
case 'is':
case '=':
if($match[0][$i] == '=' || $match[0][$i] == 'is')
{
$assignments[] = $tuid;
}
$tu[$tuid][] = $match[0][$i];
++$maxTuid;
$tu[$tuid][] = $chr.$maxTuid; // A fake token that marks the translation unit which goes here.
$stack->push($tuid);
$tuid = $maxTuid;
$tu[$tuid] = array();
break;
Here, we must do the following things:
- Write the token to the "old" translation unit (
$tu[$tuid][] = $match[0][$i];), because it must appear in the output, too. - Register the new ID for the new translation unit (
++$maxTuid;) - Save the information in the "old" translation unit, that there is another translation unit in this place (
$tu[$tuid][] = $chr.$maxTuid;). Such "special" token is represented by the new translation unit ID followed by a non-printable ASCII symbol (currently ASCII code 18 is used). - Finally, we push the "old" translation unit ID to the stack (we may need it later), and start the new translation unit.
Closing parentheses reverse the process.
Once the expression is splitted into sub-expressions, we call Opt_Compiler_Class::_compileExpression() on each of them to compile it into the PHP code. This method contains a loop that scans the translation unit token list and replaces them with their PHP equivalents. The syntax controll is quite simple, but effective. $state['next'] is a bit field and contains the opcodes of the tokens expected in the next place. Once we recognize the token type, we check, if it is mentioned in this field. If not, we throw an exception, and if yes, we replace it with the PHP code and set the new value here for the next token. Some token parsers are delegated to separate methods.
As a result, we get the list of sub-expressions compiled into the PHP code. Now we must link them back into a single expression, using the special translation unit tokens left in the token list. Again, we use stacks. The loop scans the sub-expression and adds the tokens to the output string. The possible steps are:
- If we find the translation unit token, we send the current TU onto the stack and start linking the new translation unit.
- If we reach the end of the translation unit, we pop the previous TU identifier from the stack and get back to linking the previous translation unit. If there is nothing left on the stack, we have parsed the whole expression.
Changes in OPT 2.1
In OPT 2.1, the whole code has been modularized:
- The stage 1 was moved to external parsers. The programmer may write the new parsers and switch them freely (it replaces the current "mode" implementation).
- The stage 2 node-specific code has been moved to node classes. The
Opt_Compiler_Class::_stage2()is much cleaner and simpler now, however, the general concept remains the same. It also allows to introduce new node types. - The stage 3 - like stage 2.
The expression parsing has been also moved to external expression parser classes which may be extended. The Opt_Compiler_Class::compileExpression() now parses the expression modifiers and selects the external parser only.
Conclusion
As we can see, the algorithms behind Open Power Template are very complex, but powerful in order to compile the templates precisely.

