Number of BBCode params and permutation

Number of BBCode params and permutation Started by emanuele · July 31, 2015, 02:43:49 pm · Read 37679 times 0 Members and 1 Guest are viewing this topic. previous topic - next topic

Re: Number of BBCode params and permutation

Reply #45 – January 27, 2017, 03:52:49 am

I'm jumping in without a terrible amount of knowledge but as a thought.

BBCode is a massive task, that many don't seem to appreciate, I've heard it said that the base of SMF was one of the most feature rich. I'm wondering if BBCode as a parser shouldn't become its own library - I haven't asked but i'd imagine SMF/IP/Xen/vB/phpBB/MyBB all have the exact same headache.

Is it an avenue worth exploring?

Re: Number of BBCode params and permutation

Reply #46 – January 27, 2017, 07:51:27 am

The 1.1 is already a sort of library.

Re: Number of BBCode params and permutation

Reply #47 – January 27, 2017, 10:34:02 am

Quote from: Joshua Dickerson – January 26, 2017, 10:06:14 pmHave you tried ordering the parameters? Just wondering if that had any positive effect.

I have not done this, was considering it for sorting the required then optional, it may have some positive effect. Right now the initial check of parameters is how the editor inserts the tag, so the first check is generally the correct one.

Quote from: Joshua Dickerson – January 26, 2017, 10:04:19 pmI never did like matching to the end of the message. I think adding another regular expression in there to check for that will add overhead though. I think I tried it out and that's what it did.

I don't disagree and I tried not to using a preg_match here, did one version using an explode, but in various "trip up the parser" tests the preg_match seemed to work better. The preg_match in place is done as a lazy one so it generally does as little work/processing as possible.

On a positive note, feeding a shorter string to the follow on param parser allows them to perform in less steps with less recursion, so perhaps the performance balances out. One extra preg here results in an easier to parse string for the next step.

QuoteI don't quite understand the whole thing with the optional stuff. I will have to read it again.

I'll try by example. Take the IMG tag, all of its parameters are defined as optional, so the regex that gets build would be something like

Code: [Select]

^(\s+width=(\d+))?(\s+height=(\d+))?(\s+alt=(.+?))?\]

Now pass that regex

Code: [Select]

 alt=something for alt width=100 height=100]http://www.somesite.tld.someimage.png[/img]

The parser would say it has found a match and stop looking, but it would match just the alt tag as "something for alt width=100 height=100" due to all those '?' alterations on the capture groups.

The change I made was to not add the ? alteration in our regex_cache array and instead build a simple param_check. It gets used in a simple stripos search on the current tag being parsed. Based on that check it appends, where needed, the ? alteration to regex_cache that will be used in the order search. So in the above example, all of the ? alterations would be removed, the match would not occur, and the next combination would be tried.

Quote from: Joshua Dickerson – January 26, 2017, 10:04:19 pmThe preparser could add flags in there from the start of a possible tag to the end of a possible tag. Then we could either use explode() to create an array (I think that's actually slower) or use strpos() to find the next value (should be faster). All it would do is search for [a-Z to find the start and then find the next ] that's not in a quoted string. If no closer is found, it's not possible to be a tag and the opener flag should not be added. Obviously, that would be an upgrading headache.

If we could add proper start and end markers to the tags we would be in biscuit city!

Re: Number of BBCode params and permutation

Reply #48 – January 27, 2017, 10:40:26 am

Quote from: emanuele – January 27, 2017, 07:51:27 amThe 1.1 is already a sort of library.

Is it worth being its own 'independent library'? Like would it be useful for multiple systems to improve one, or are implementations far too different?

Re: Number of BBCode params and permutation

Reply #49 – January 27, 2017, 07:28:09 pm

Quote from: Trekkie101 – January 27, 2017, 10:40:26 am
Quote from: emanuele – January 27, 2017, 07:51:27 amThe 1.1 is already a sort of library.

Is it worth being its own 'independent library'? Like would it be useful for multiple systems to improve one, or are implementations far too different?

I wrote it with the idea that it would be useful for others. If someone wants to break it out and make it its own library, it free and open source. I don't even know what license it is in. I'm not entirely sure if I can relicense it because it's a derivative of SMF and thus Elkarte, but if I can and someone needs me to, I'll sign off on whatever necessary. Emanuele and Spuds are probably in the same boat; if it makes our lives easier to maintain, let's do it!

Re: Number of BBCode params and permutation

Reply #50 – January 27, 2017, 08:41:55 pm

Quote from: Spuds – January 27, 2017, 10:34:02 amIf we could add proper start and end markers to the tags we would be in biscuit city!

The rules are really simple for what the tags must be:

start with [
the next character must be a letter
then any alphanum character with an unlimited distance
followed by a space, =, or ]

That gets us the start of the tags. Easy, but not good. We already do that in the parser and that process is a really small one.

The next part is to find the end. We start by having a beginning. That's easy. We could probably put this in a single regular expression but not necessary. The rules are:

find the next occurrence of ]
must not be inside of quotes. Not sure where this process would be, but they are either " or ". There's tons of examples of finding a character/string when it's not inside quotes so I am not about to try writing it out in a post.

Closing tags:

starts with [
the next character must be a /
the next character after that must be a letter
then any alphanum character with an unlimited length
then a ]

The message would look like

Code: [Select]

Hello \r[b]\rworld\r[/b]\r,<br>I'm Josh.

We don't need to know if it is a tag at the time of preparsing. In fact, we don't want to know. We just want to be able to say that it might be a tag. We add a flag like \r or \a or any character that we removed before. If I hear one person complain the message size increase, I'm going to choke them. After we check if the first part is a tag, we can then do a substr() to the next occurrence of that flag.

For parameters, I would use a different flag. As an example \a. They are a little bit trickier. Generally, they are pretty close to tag names. As an example, that would look like:

Code: [Select]

Hello \r[img \aalt=&quot;Hello World&quot\a]\rworld.png\r[/img]\r

Not sure if the flags should be on the inside of the [] and if there should be another flag around the tags and parameters themselves. As an example:

Code: [Select]

Hello [\rimg\f \aalt\a=&quot;\nHello World\n&quot;\r]\aworld.png\t[/img]\t

This removes the need for any regular expressions.

Ugh, now I really don't even want to use flags. I just want to do this the right way and make it into an array. That's the next step though and I don't want to go down a rabbit hole.

Re: Number of BBCode params and permutation

Reply #51 – January 27, 2017, 08:42:11 pm

I spent a few minutes writing out an example parser:

Code: [Select]

<?php

$msg = "Hello [\rimg\f \aalt\z=&quot;\nHello World\n&quot; \atitle\z=\nHi\n thisisempty\n\n\r]\qworld.png\t[/img]\t";

$nextChar = strpos($msg, "\r");
if ($nextChar === false) {
    cleanString($msg);
}

$newMsg = '';

// I don't think we need a do/while, actually. A while should work.
do {
    $tagEndPosition = strpos($msg, "\f", $nextChar);
    $tag = substr($msg, $nextChar, $tagEndPosition - $nextChar);
    if (isset($bbc[$tag])) {
        $checkCodes = $bbc[$tag];
        
        // Find the next \r
        $paramStringEndPos = strpos($msg, "\r", $tagEndPosition);
        
        // If the next \r is not 0 or 1 difference:
        // Find the next \a
        $paramEndPos = $tagEndPosition;
        $params = [];
        while(false !== ($paramPos = strpos($msg, "\a", $paramEndPos))) {
            // Find the param
            $paramEndPos = strpos($msg, "\z", $paramPos);
            $param = substr($msg, $paramPos, $paramEndPos - $paramPos);
            
            // Get the value (empty values use \n\n to be easy)
            $valuePos = strpos($msg, "\n", $paramEndPos);
            $valueEndPos = strpos($msg, "\n", $valuePos);
            $value = substr($msg, $valuePos, $valueEndPos - $valuePos);
            
            // Comma delimited strings are a bit different, but their key would just be 1, 2, 3, etc.
            $params[$param] = $value;
        }
        
        ksort($params);
        
        $foundCode = false;
        foreach ($checkCodes as $code) {
            if (!checkRequiredParameters($code, $params)) {
                continue;
            }
            
            $optionalParameters = array_diff_key($code->getRequiredParameters(), $params);
            
            if (!checkOptionalParameters($code, $optionalParameters)) {
                continue;
            }
            
            // Cool... now we have a tag and parameters.
            
            // Next step is to get any content it has. You can guess how that will go.
            // Finally, we parse the code.
            parseCode($code, $params, $content);
        }
        
        $nextChar = $foundCode ? $closingTagClosingPos : $valueEndPos;
    } else {
        cleanString($msg);
    }
} while ($nextChar = strpos($msg, "\r"));


function checkRequiredParameters($code, array $params)
{
    foreach ($code->getRequiredParameters() as $requiredParameter) {
        if (!isset($params[$requiredParameter])) {
            return false;
        }
    }
    
    return true;
}

function checkOptionalParameters($code, array $params)
{
    return array_intersect($params, $code->getOptionalParameters()) !== array();
}

// Clean any remaining special characters
function cleanString($msg)
{
    return str_replace(["\r", "\f", "\a", "\n"], '', $msg);
}

Re: Number of BBCode params and permutation

Reply #52 – January 27, 2017, 08:42:56 pm

There is no need to do the ksort actually. All of that would happen in the preparser.