ElkArte Community

Elk Development => Bug Reports => Exterminated Bugs => Topic started by: emanuele on July 31, 2015, 02:43:49 pm

Title: Number of BBCode params and permutation
Post by: emanuele on July 31, 2015, 02:43:49 pm
Reported at sm.org https://github.com/SimpleMachines/SMF2.1/pull/2956
Basically parse_bbc creates all the possible permutations of the parameters of a bbocde in order to catch any possible order of them. In case a bbcode has more than 12 parameters the parser starts to create an array with more than 10k entries and after more than 14 the memory consumption goes above the 90MB.

The proposed solution is to dynamically re-order the parameters of a bbcode in order not to have to guess the order at all.

A couple of considerations.
1) Provide more than 5 parameters for a bbcode is insane and pretty much useless. Why? For example because nobody is going to use them anyway (people don't even remember the bbcode tags, let alone 15 parameters of a single bbcode).
2) The solution may actually be rather nice, though. But do it in parse_bbc looks kind of overkill to me: to fix a one in a 10 years problem, you are going to reorder each and every bbcode tag parameters out there forever. If there is any interest in applying this solution, it would be cool do it before save the text in the database so that there will be no need to do anything at parsing-time, though this solution may require some more work at "preparsecode-time" (like parse the text and replace it back with bbcode instead of HTML)... :-\
Title: Re: Number of BBCode params and permutation
Post by: Flavio93Zena on July 31, 2015, 05:12:17 pm
I get a white page if I apply those edits so you might review them in case you are getting inspiration.
Fatal error: Call-time pass-by-reference has been removed in Subs.php on line 887 but thinking about it it might be due to another edit I had done to the parser - http://custom.simplemachines.org/mods/index.php?mod=2191
Title: Re: Number of BBCode params and permutation
Post by: Spuds on July 31, 2015, 05:22:29 pm
I heard about this one as well, @dcmouser let me know that another ILA mod (not mine, I'm  O:-) ) was causing things to explode in that repetition without duplication function. 

Obviously its a corner case, it takes a lot of parameters to cause the issue, and I think its a better, or a more natural, UI to wrap BBcodes for certain effects since you can do that with the just the editor buttons.  Like if you want bold italic underline, you don't use [bold typeface=italic style=underline]bla etc.

That said, its still bad code, and the docblock for the function notes that fact.  Its a ticking time bomb for someone willing to add lots of bbc parameters. 

Having spent more than just a little "quality" time with parse_bbc, I also know you don't want to make changes unless you have a least a decent test bed of input vs expected results to verify changes,  it may be the single most important function.  You also want to benchmark a normal post code before and after any changel,  I've seen changes that fix an edge case at the expense of the other 99% in terms of speed / memory.  Not indicating thats the case here since I have not looked at the solution nor bench marked it, just saying its something that would have to be done.
Title: Re: Number of BBCode params and permutation
Post by: emanuele on July 31, 2015, 06:45:42 pm
Quote from: Spuds – let me know that another ILA mod (not mine, I'm  O:-) )
You have been just lucky this time! :P

Quote from: Spuds – That said, its still bad code, and the docblock for the function notes that fact.  Its a ticking time bomb for someone willing to add lots of bbc parameters. 
/me likes bombs!! :D KABOOOMM!!

Quote from: Spuds – Having spent more than just a little "quality" time with parse_bbc, I also know you don't want to make changes unless you have a least a decent test bed of input vs expected results to verify changes,  [...]
BIG NODS to everything.
I also think the proposed fix will not cover many edge cases: the reason (or at least one of the reasons) the "parsing" of the parameters is done with a regexp on all the permutations is because rely on spaces, square brackets and equal signs is not reliable (just an example, the proposed code will probably fail straight away quoting a message from @[SiNaN] [1], see the very first strpos).

@Flavio93Zena the reason it fails is because there is a typo:
Code: [Select]
function_param_order($message, &$parameters, &$out)
should be:
Code: [Select]
function fix_param_order($message, &$parameters, &$out)
Yeah, it's always your fault!! :P
Title: Re: Number of BBCode params and permutation
Post by: Spuds on July 31, 2015, 08:28:38 pm
LOL .. yes kaboooom indeed!

Of interest as I've been looking at that function, I think there is actually a bug when you get > 3 parameters as well .. I need to do some more testing to be sure so I'm doing some more digging.
Title: Re: Number of BBCode params and permutation
Post by: Flavio93Zena on August 01, 2015, 12:19:15 am
Well, seeing your reply I'd like a fix by someone with more experience (you :P). I don't think you changed the code too much as regards the BBCode stuff, otherwise this topic wouldn't have been here. So basically the fix you are going to apply could potentially be either similar or identical on Elk and SMF. But I think it also depends on which way you are going to choose.
/me crosses fingers hoping he got it right ;D
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 01, 2015, 11:11:22 am
So looked at this some more this AM, and the issue is not so much that the current function "choked" when given 14 elements, but that it appears to return wrong results for anything > 3 elements, at least if I understand what it is supposed to do.

- Given 3 params like array(1, 2, 3) it should return 6 combinations or 3! or 3*2*1 which it does.
- Given 4 params like array(1, 2, 3, 4) it should return 24 combinations or 4! or 4*3*2*1, instead it returns 18 combinations and some of them are duplicate sets, so not exactly what it should be doing. 
- Given 8 params it returns 770 combinations not the 40,320 that it should.

So the fact that it even got to 14 before it choked the system is an error, it should have done so much sooner  :P   (14 = 87,178,291,200 possible orders that 14 items could be arranged),  I did fix the algorithm to return the correct, non repeating, sets but the fixed function just exasperates the out of memory issue.

We could be smarter about the check as well.  Say a code has 14 potential parameters, but they only use 3.  In this case there would only be 2184 combinations to check, not the 14Billion, thats a huge reduction.  Of course nothing would stop the user from entering all 14 so they could a nice DoS

TLDR: Its fubar
Title: Re: Number of BBCode params and permutation
Post by: Flavio93Zena on August 01, 2015, 11:16:38 am
Good job! Will you apply such a fix in 1.0.5 (hopefully to me xD) or 1.1?
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 01, 2015, 01:27:02 pm
I think before starting, this needs tests.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 01, 2015, 01:54:42 pm
I don't think we have any solution to test at least I don't, and TBH I've not come up with anything. 

All I've found is that the permute function is broken and has been form shugs who knows how long.  I can fix that but it means even fewer parameters could be iterated on before you have some wsod.

To me the major stumbling block is parameters that contain ] characters, like author names.  Given that condition, its not easy to break down the regex to find / extract individual patterns, instead you need the full pattern containing all parameters to avoid missing the correct ]
Title: Re: Number of BBCode params and permutation
Post by: emanuele on August 01, 2015, 02:13:40 pm
Quote from: Spuds – So the fact that it even got to 14 before it choked the system is an error, it should have done so much sooner  :P
Cool!...

Quote from: Spuds – We could be smarter about the check as well.  Say a code has 14 potential parameters, but they only use 3.  In this case there would only be 2184 combinations to check, not the 14Billion, thats a huge reduction.  Of course nothing would stop the user from entering all 14 so they could a nice DoS
Though, we don't know in advance which are used in the body, because it should be that permutation giving use the result of which params are used...

Code: [Select]
(author)=(.*?)(?:author|link|date)|(link)=(.*?)(?:author|link|date)|(date)=(.*?)(?:author|link|date|\])
/me runs!

LOL nods
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 01, 2015, 02:39:09 pm
https://www.google.com/search?q=php+get+permutations+of+an+array

http://docstore.mik.ua/orelly/webprog/pcook/ch04_26.htm

The tests would require a pretty large array. Perhaps too large.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 01, 2015, 03:08:25 pm
I did fix the permute code and verified that the new code is returning the correct number of sets.  Past say a set of 5 (120 combinations) it gets to be a massive chore to validate the validity of the returned arrays, but I had that data available to me.

The problem is even with the fixed permute function, the system is badly broken, actually more broken.  Permutation without repetitions is N! so 14 is 14*13*12*11*10*9*8*7*6*5*4*3*2*1 = 87,178,291,200 possible combinations ... Obviously we can't have the system trying to (potentially) examine 87Billion combinations .... that is if it could even get to the point of being able to calculate the 87Billion combinations before running out of memory. 

Its honestly more of an addon issue, however the potential for the error is in the code for sure.   Even if we were to use a smarter approach based on the actual number of passed parameters (not easy to do either) it will still leave sites exposed to some miscreant entering all 14 or 10 or whatever the number, and it would choke the system to death.

I'm really not sure what to do to fix the potential issue.  Could put a simple "max" check on the permute function to prevent the runaway as it will only ever occur if a addon author adds a large enough number of parameters to cause an issue, number be > 6  (6=720 checks, 7=5,040 8=40,320 9=362,880 10=3,628,880  I think once past 6 you are going to notice the effect, certainly by 10 you are in to seconds. 

I did not look at the posted solution, but trying to order the parameters to match a "standard" may be the way to go, if you can actually extract the various parameters with success.
Title: Re: Number of BBCode params and permutation
Post by: emanuele on August 01, 2015, 07:02:26 pm
As a quick fix I would do exactly that: 1) fix the permutation code, 2) hard-limit to 5 or 6 the number of allowed parameter.

Past that, ordering needs IMO some benchmarking, I feel it's counter-productive with a reasonable amount of parameter (less than 5), but it's just a feeling.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 02, 2015, 01:36:46 pm
http://www.simplemachines.org/community/index.php?topic=538611.msg3827920#msg3827920

Without testing it, I think this saves a lot of RAM without losing any functionality. It will allow you to get a lot more parameters but not infinite.

In Subs.php, find:
Code: [Select]
			// This is long, but it makes things much easier and cleaner.
if (!empty($possible['parameters']))
{
$preg = array();
foreach ($possible['parameters'] as $p => $info)
$preg[] = '(\s+' . $p . '=' . (empty($info['quoted']) ? '' : '"') . (isset($info['match']) ? $info['match'] : '(.+?)') . (empty($info['quoted']) ? '' : '"') . ')' . (empty($info['optional']) ? '' : '?');

// Okay, this may look ugly and it is, but it's not going to happen much and it is the best way
// of allowing any order of parameters but still parsing them right.
$match = false;
$orders = permute($preg);
foreach ($orders as $p)
if (preg_match('~^' . implode('', $p) . '\]~i', substr($message, $pos1 - 1), $matches) != 0)
{
$match = true;
break;
}

// Didn't match our parameter list, try the next possible.
if (!$match)
continue;

Replace
Code: [Select]
			// This is long, but it makes things much easier and cleaner.
if (!empty($possible['parameters']))
{
// Didn't match our parameter list, try the next possible.
if (!find_bbc_parameters($possible['parameters'], substr($message, $pos1 - 1), $matches))
continue;

Add this to Subs.php
Code: [Select]
/**
 * Find the correct order for the BBC parameters in a string
 *
 * @param array $parameters
 * @param string $message_part
 * @param array &$matches
 *
 * @return bool
 */
function find_bbc_parameters(array $parameters, $message_part, &$matches)
{
$preg = array();
foreach ($parameters as $p => $info)
{
$preg[] = '(\s+' . $p . '=' . (empty($info['quoted']) ? '' : '"') . (isset($info['match']) ? $info['match'] : '(.+?)') . (empty($info['quoted']) ? '' : '"') . ')' . (empty($info['optional']) ? '' : '?');
}

// Okay, this may look ugly and it is, but it's not going to happen much and it is the best way
// of allowing any order of parameters but still parsing them right.
$match = false;
$orders = permute(array_keys($preg));

foreach ($orders as $keys)
{
$match_string = '~^';
foreach ($keys as $key)
{
$match_string .= $preg[$key];
}
$match_string .= '\]~i';

if (preg_match($match_string, $message_part), $matches) != 0)
{
$match = true;
break;
}
}

return $match;
}

As two added benefits, it gets rid of the $orders array and doesn't do so much string copying with the substr() in the loop.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 02, 2015, 03:09:51 pm
Cool, thanks for taking a look at the problem.  Can't test at the moment as I'm in addon mania right now :P  .... That and I don't have a bbc code with lots of parameters to test so need to create one of those. 

Are you thinking it will save memory because of the use of array keys instead of the full arrays?   I saw that cute little substr in the loop as well, was going to address that at some point too.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 02, 2015, 03:11:13 pm
Instead of using an array of strings, you're using small integers.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 02, 2015, 03:15:00 pm
That link I posted is a much better solution though. We don't need to figure out all of the permutations at once. We can do it one at a time and see if it matches. If it matches, just return the match. I suspect that most of the time, it will be complete within a few permutations. This should speed it up for everything.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 02, 2015, 03:42:25 pm
Quote from: Joshua Dickerson – Instead of using an array of strings, you're using small integers.
In my permute test file, thats how I've been testing.  Your fine through 8, but since the number grows exponentially, get to 9 will require ~225M of memory, at least on my old machine it did.  10 took almost 2G.  Thats with an input of array(1,2,3,4,5,6,7,8,9,10);

Quote from: Joshua Dickerson – That link I posted is a much better solution though. We don't need to figure out all of the permutations at once. We can do it one at a time and see if it matches. If it matches, just return the match. I suspect that most of the time, it will be complete within a few permutations. This should speed it up for everything.
Below say 5 (need to verify with "real" data), the permute overhead is measured in .00x of seconds, past that you do start to ramp up, again since its exponential.   Which link or do you mean the code you posted?  Either way, yes not having to figure all of the permutations up front will help in the cases where there are lots of params.  Actually thinking about that, the only permutes needed would be on the required params, if the others are optional, they can just be appended to the permuted regex of the required.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 03, 2015, 08:08:30 pm
http://docstore.mik.ua/orelly/webprog/pcook/ch04_26.htm

This is where generators would come in handy. We could just do the checks as we go along. Then, you can essentially have unlimited. I am going to work on that in Notepad really quick.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 04, 2015, 07:48:10 am
nods thats a very good idea, being able to use the intermediate results should be a help.   We are still dealing with massive sets though,  be interesting to see what happens to execution time when you get to a possible solution set of 36Billion, not that it should take that many to "find" a solution but still
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 04, 2015, 12:44:48 pm
Driving from FL to NJ so I won't have this until later. I still think 14 parameters is ridiculous and ordering them might be better. So ultimately, the proposal I'd make is using the permutation method on save to get the parameters and then ordering them for later.

I like what XF does with an AST. I think this is the future but that's another topic.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 05, 2015, 01:09:33 am
Be kind... I just got back from 15 hours of driving, I left my computer in FL, took someone else's by accident, and wrote this in Notepad with severely bad internet connection (storms I presume).

@Spuds, can you give this a shot? I think it will work. I think you'll understand how it should be placed in Subs.php. I just took the example from that book and it looks like it will fit perfectly. It does a check per permutation instead of getting them all before.

Code: [Select]
function pc_next_permutation($p, $size) {
    // slide down the array looking for where we're smaller than the next guy
    for ($i = $size - 1; $p[$i] >= $p[$i+1]; --$i) { }

    // if this doesn't occur, we've finished our permutations
    // the array is reversed: (1, 2, 3, 4) => (4, 3, 2, 1)
    if ($i == -1) { return false; }

    // slide down the array looking for a bigger number than what we found before
    for ($j = $size; $p[$j] <= $p[$i]; --$j) { }

    // swap them
    $tmp = $p[$i]; $p[$i] = $p[$j]; $p[$j] = $tmp;

    // now reverse the elements in between by swapping the ends
    for (++$i, $j = $size; $i < $j; ++$i, --$j) {
         $tmp = $p[$i]; $p[$i] = $p[$j]; $p[$j] = $tmp;
    }

    return $p;
}

$set = array();
foreach ($possible['parameters'] as $p => $info)
{
    $set[] = '(\s+' . $p . '=' . (empty($info['quoted']) ? '' : '"') . (isset($info['match']) ? $info['match'] : '(.+?)') . (empty($info['quoted']) ? '' : '"') . ')' . (empty($info['optional']) ? '' : '?');
}
$size = count($set) - 1;
$keys = range(0, $size);
$message_part = substr($message, $pos1 - 1);

do {
    $match_string = '~^';
    foreach ($keys as $key)
    {
        $match_string .= $set[$key];
    }
    $match_string .= '\]~i';
    $match = preg_match($match_string, $message_part), $matches) != 0;
} while (!$match && ($keys = pc_next_permutation($keys, $size)));
Title: Re: Number of BBCode params and permutation
Post by: Flavio93Zena on August 05, 2015, 01:30:34 am
I have no idea about what that means, and I have no idea how it is supposed to work either. What I do know is just one thing: you both scared and shocked me, you are amazing dude!
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 05, 2015, 06:17:27 am
thanks @Joshua Dickerson I'll take a look at this later today  :)
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 11, 2015, 09:33:43 am
Summary ...

The memory usage / time taken is only that used to compute the N!, it does nothing with the results for this timing and memory usage table. 

As expected the new function, since it computes a permutation at a time, has a generally constant memory footprint, but begins to ramp up in the time taken to compute the combinations since its returning the results on a per loop basis.

The old permute function although much faster for larger sets, is memory hungry as the set size increases.  The final set (10) timings and memory size are estimated as it ate all I had in my testbed (>1.8G when it failed).

So we can solve the memory issue with returning the intermediate values, however execution time will become an issue.    I think we could safely allow 8 params (40,320 combos)  in conjunction with the new function, but I should test the time for (potentially) 40320
preg_match calls as well.  Could also simply "clamp" the existing function to a maximum of 5 parameters.

The ultimate fix would be to order any passed parameters to what the function expects, that however may hold some unexpected consequences in the parser.  I also think its a bit of an example of a solution looking for a problem, as IMO that is simply to many parameters for a single bbc code.

# keysTime
(PC_NEXT)
Memory
(PC_NEXT)
Time
(PERMUTE)
Memory
(PERMUTE)
30166k0166k
40166k0174k
50166k0215k
6.015166k0498k
7.047166k.0112,740k
8.280166k.09222,685k
92.316166k.795231,605k
1023.17166k~72,000,000k
Title: Re: Number of BBCode params and permutation
Post by: emanuele on August 11, 2015, 11:47:38 am
Just thinking out loud, without any test or supporting data.
If I understood right, doing one permutation at a time means testing for example:
author
data
link
author + data
data + author
author + data + link
* author + link + data
etc.
one at each iteration.
But that means the first one could already match a tag
Code: [Select]
[quote author=emanuele data=12345678 link=topic=1] isn't it?
Unless we start from the latest one (the "longest") and come back to the shortest.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 11, 2015, 03:05:34 pm
The permutations will always return the same number of keys it was passed ...  so pass it 1,2,3 you will get
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1

Not necessarily in that order but whatever the algo returns.
Title: Re: Number of BBCode params and permutation
Post by: emanuele on August 11, 2015, 04:08:12 pm
Told you I didn't look at the code. :P
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 11, 2015, 06:57:37 pm
LOL, I don't blame you.

Here is what the section in subs would look like if we move to the lexicographic permutation (above) which allows us to save memory.  This one I placed an iteration clamp on the function which will an exhaustive search of up to 7 parameters and a subset of more than that.  Could increase that but why?   The reality is the code was wrong for > 3 for what 10+ years?  I know that's not an excuse to ignore the error, its simply stated to frame the extent of the issue.

The benefits here are
- much lower memory requirements for the fringe cases that have > 3 parameters
- no real difference (memory or speed) for <= 3 parameters as compared to our current function
- actually works for >3 parameters, as I noted the current function has a simple error that makes it wrong for > 3
- even if 20 parameters are passed, it will still return some permutations, based on the max_iterations variable, I chose the number to allow an exhaustive search of 7 parameters which should be enough for 99.99 of the cases.

Code: [Select]
				// Okay, this may look ugly and it is, but it's not going to happen much and it is the best way
// of allowing any order of parameters but still parsing them right.
$param_size = count($preg) - 1;
$preg_keys = range(0, $param_size);
$message_stub = substr($message, $pos1 - 1);

// If an addon adds many parameters we can exceed max_execution time, lets prevent that
// 5040 = 7, 40,320 = 8, (N!) etc
$max_iterations = 5040;

// Step, one by one, through all possible permutations of the parameters until we have a match
do {
$match_preg = '~^';
foreach ($preg_keys as $key)
$match_preg .= $preg[$key];
$match_preg .= '\]~i';

// Check if this combination of parameters matches the user input
$match = preg_match($match_preg, $message_stub, $matches) !== 0;
} while (!$match && --$max_iterations && ($preg_keys = pc_next_permutation($preg_keys, $param_size)));
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 11, 2015, 07:26:51 pm
How does his improved function (http://www.simplemachines.org/community/index.php?topic=538611.msg3828139#msg3828139) work?

I wonder what exactly is taking so long in the pc_next_permutation function. I like your solution to max it.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 11, 2015, 07:45:54 pm
No idea on that function, but feel free  :D

TBH I'm not sure how would we verify 10 years of undocumented bbc cases plus whatever addons may have done.  I know some of pregs in the parameters are not really stand alone, they need the other parameters to limit greediness etc.

The other thing I keep getting stuck on is that for 99.99 of the cases what was in place simply worked*.  I'm comfortable changing from a recursive permute to the lexicographic version since its really the same functionality, just a memory "smart" version.  This leaves the overall parse function unaltered.

I'd imagine the slower is simply the do/while plus comparison overhead.  There is some of the same overhead in the old function, its just *after
the permutations are computed.  So those numbers are a bit unfair but so is life :P
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 11, 2015, 11:09:16 pm
Oh, wait, the comparison includes the preg_match in the pc_next_permutation test and not in the other?

Ugh, I really need to get my damn computer back. This is killing me. I know that the proposed permute function works but now I want to test it against dougiefresh's. If his works and does so with much less resources and a limitless number of parameters, seems like his would be the way to go.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 11, 2015, 11:16:18 pm
I really want to know, looking at a large number of posts with BBC, how often would the preg condition not match within X tries. Where X is whatever would add any measurable amount of time. Would be nice to have a couple of forums to be able to do some data analysis on.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 12, 2015, 08:40:25 am
I only did a simple test of performance on that, using quote blocks. 

Had a message with 10 quotes, some normal, some nested, with a couple of them using all 3 parameters out of order.  Ran 3 loops of 100.

Old permute function: Avg 3.73421 seconds in parse bbc and .19034 seconds in pre parse
New permute function: Avg 3.7125 seconds in parse bbc and  .18801 seconds in pre parse

Net: no difference, should have checked memory but for 3 parameters they should be about equal per previous postings based on 3 parameters.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on August 13, 2015, 07:24:53 pm
I finally am able to test this out.

I am getting issues with the first quote code. Since there is only 1 param, the size results in 0.

At
for ($i = $size - 1; $p[$i] >= $p[$i + 1]; --$i);
it makes $i = -1. Then it loops through $p and stops when the value of $p is >= the last one. Problem is, there is no $p[-1]. So, it is an endless loop with 2 warnings.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on August 13, 2015, 09:14:33 pm
I think I committed / tested that as
Code: [Select]
for ($i = $size - 1; isset($p[$i]) && $p[$i] >= $p[$i + 1]; --$i)
 
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 07, 2017, 03:13:08 pm
https://github.com/SimpleMachines/SMF2.1/commit/f62a164fa767948d5e73e56619bf70dea88a5fc3
Title: Re: Number of BBCode params and permutation
Post by: Spuds on January 07, 2017, 05:32:06 pm
Personally I like what we have, and its likely faster anyway.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 10, 2017, 01:02:05 am
The only thing I think might make sense to pull is the sorting of the parameters so there is no need to put the regular expression in a loop.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on January 25, 2017, 03:00:31 pm
Finally got to looking at this after getting nudged again, and using a test case that fails :P I think there are a couple of issues here, at least in 1.1, the 1.0 parser is what it is at this point.

First by the time we search for parameters, all \n's have been replaced by <br>. The issue is that without any \n's in the string the regex's tend to act as though the ~~s modifier is enabled.  This may cause the regex to capture past its parameter group.  We can fix this behavior by swapping out the  <br> with \n in the $message_stub variable but there is another "issue".

Second $message_stub starts at the current parameter marker and runs to the end of the message.  Doing the \n would fix most cases of regex greediness, but you can craft a message to cause problems.  I'd suggest clipping the $message_stub at the closing $possible[Codes::ATTR_TAG] value instead of doing any \n swap.  You can still craft a message to make the parser fail but its narrow.

Third the IMG code is defined with all optional parameters.  So if you supply an IMG  tag with all the available parameters, only (1) has to be in the right spot and the regex match is made, and very likely incomplete.  We need to define the code first as all required, which will properly parse a full loaded tag, and then have the current optional one, which will properly match a N-1 parameter list.  I added a very basic check to skip full parameter searching when the supplied tag does not contain the necessary DNA to warrant the search.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on January 26, 2017, 07:09:15 pm
After some more staring at my navel, Omphaloskepsis if you must know, the general issue is the use of the "optional" value on BBC parameters (as outlined above) ... The current way the search regex is built, it always adds the ? parameter to the regex which is not the right thing to do.

For example you use "width= alt= height=" in your img tag, all of those are optional ... if you build the regex as (width=)?(height=)?(alt=)? it will match, but only return width, alt and height are missed.   My first thought was to fix it by adding in a non optional code definition.  This does work, but it get progressively worse the more optional tags you create ( ie >3 )

My current plan is two fold. 

First is still trim down $message_stub so it just contains the tag in question (open to close) ... this helps prevent any overrun with nested tags etc, makes the regex nicer.  Currently its from the open tag to end of message and that can, in some cases, cause issues.

Second is to set the optional (?) modifier only on params that are optional AND not being used.  I added a simple (which can be made more robust if we find its needed) to check what parameters are being used on a current tag.  Then we add the optional modifier based on that condition.  So in the above example the user used the width= height= alt= parameters on a given tag.  Even though those are all optional, since they were used, they become required in the regex match.  The rest of the permutation stuff is the same and now runs to catch the "any" order.

https://github.com/elkarte/Elkarte/compare/development...Spuds:parseUpdates?expand=1
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 26, 2017, 10:04:19 pm
I never did like matching to the end of the message. I think adding another regular expression in there to check for that will add overhead though. I think I tried it out and that's what it did.

The preparser could add flags in there from the start of a possible tag to the end of a possible tag. Then we could either use explode() to create an array (I think that's actually slower) or use strpos() to find the next value (should be faster). All it would do is search for [a-Z to find the start and then find the next ] that's not in a quoted string. If no closer is found, it's not possible to be a tag and the opener flag should not be added. Obviously, that would be an upgrading headache.

I don't quite understand the whole thing with the optional stuff. I will have to read it again.

I don't know the minimum version of PHP, but the main loop is a perfect place to use a generator.

I wonder if combining the regular expressions instead of putting them in a loop will be faster. http://nikic.github.io/2014/02/18/Fast-request-routing-using-regular-expressions.html explains what I'm talking about.

In the end, if your solutions fixes the bug, I think that's the way to go. Then we can worry about optimizing it. Really, we just need to create an AST and not have to worry about performance anymore.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 26, 2017, 10:06:14 pm
Have you tried ordering the parameters? Just wondering if that had any positive effect.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 26, 2017, 10:41:14 pm
I wonder if we couldn't also apply the flagging technique to parameters. Consider this:


I think I'm missing something huge here, but this essentially removes the need to use any regular expressions to find the parameters at read-time.

This would be the start of an AST which is a plus IMO.

PS: https://github.com/Spuds/Elkarte/blob/1be23253fa5c6a1d8e7cc16336b04e6579d336d3/sources/subs/BBC/Codes.php#L267 :( sorry
Title: Re: Number of BBCode params and permutation
Post by: Trekkie101 on January 27, 2017, 03:52:49 am
I'm jumping in without a terrible amount of knowledge but as a thought.

BBCode is a massive task, that many don't seem to appreciate, I've heard it said that the base of SMF was one of the most feature rich. I'm wondering if BBCode as a parser shouldn't become its own library - I haven't asked but i'd imagine SMF/IP/Xen/vB/phpBB/MyBB all have the exact same headache.

Is it an avenue worth exploring?
Title: Re: Number of BBCode params and permutation
Post by: emanuele on January 27, 2017, 07:51:27 am
The 1.1 is already a sort of library.
Title: Re: Number of BBCode params and permutation
Post by: Spuds on January 27, 2017, 10:34:02 am
Quote from: Joshua Dickerson – Have you tried ordering the parameters? Just wondering if that had any positive effect.
I have not done this, was considering it for sorting the required then optional, it may have some positive effect.  Right now the initial check of parameters is how the editor inserts the tag,  so the first check is generally the correct one.

Quote from: Joshua Dickerson – I never did like matching to the end of the message. I think adding another regular expression in there to check for that will add overhead though. I think I tried it out and that's what it did.
I don't disagree and I tried not to using a preg_match here, did one version using an explode, but in various "trip up the parser" tests the preg_match seemed to work better.  The preg_match in place is done as a lazy one so it generally does as little work/processing as possible. 

On a positive note, feeding a shorter string to the follow on param parser allows them to perform in less steps with less recursion, so perhaps the performance balances out. One extra preg here results in an easier to parse string for the next step.

QuoteI don't quite understand the whole thing with the optional stuff. I will have to read it again.
I'll try by example.  Take the IMG tag, all of its parameters are defined as optional, so the regex that gets build would be something like
Code: [Select]
^(\s+width=(\d+))?(\s+height=(\d+))?(\s+alt=(.+?))?\]
Now pass that regex
Code: [Select]
 alt=something for alt width=100 height=100]http://www.somesite.tld.someimage.png[/img]
The parser would say it has found a match and stop looking, but it would match just the alt tag as "something for alt width=100 height=100" due to all those '?' alterations on the capture groups.

The change I made was to not add the ? alteration in our regex_cache array and instead build a simple param_check.  It gets used in a simple stripos search on the current tag being parsed.  Based on that check it appends, where needed, the ? alteration to regex_cache that will be used in the order search.   So in the above example, all of the ? alterations would be removed, the match would not occur, and the next combination would be tried.

Quote from: Joshua Dickerson – The preparser could add flags in there from the start of a possible tag to the end of a possible tag. Then we could either use explode() to create an array (I think that's actually slower) or use strpos() to find the next value (should be faster). All it would do is search for [a-Z to find the start and then find the next ] that's not in a quoted string. If no closer is found, it's not possible to be a tag and the opener flag should not be added. Obviously, that would be an upgrading headache.
If we could add proper start and end markers to the tags we would be in biscuit city!
Title: Re: Number of BBCode params and permutation
Post by: Trekkie101 on January 27, 2017, 10:40:26 am
Quote from: emanuele – The 1.1 is already a sort of library.

Is it worth being its own 'independent library'? Like would it be useful for multiple systems to improve one, or are implementations far too different?
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 27, 2017, 07:28:09 pm
Quote from: Trekkie101 –
Quote from: emanuele – The 1.1 is already a sort of library.

Is it worth being its own 'independent library'? Like would it be useful for multiple systems to improve one, or are implementations far too different?
I wrote it with the idea that it would be useful for others. If someone wants to break it out and make it its own library, it free and open source. I don't even know what license it is in. I'm not entirely sure if I can relicense it because it's a derivative of SMF and thus Elkarte, but if I can and someone needs me to, I'll sign off on whatever necessary. Emanuele and Spuds are probably in the same boat; if it makes our lives easier to maintain, let's do it!
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 27, 2017, 08:41:55 pm
Quote from: Spuds – If we could add proper start and end markers to the tags we would be in biscuit city!
The rules are really simple for what the tags must be: 

That gets us the start of the tags. Easy, but not good. We already do that in the parser and that process is a really small one.

The next part is to find the end. We start by having a beginning. That's easy. We could probably put this in a single regular expression but not necessary. The rules are:

Closing tags:

The message would look like 

Code: [Select]
Hello \r[b]\rworld\r[/b]\r,<br>I'm Josh.

We don't need to know if it is a tag at the time of 
preparsing. In fact, we don't want to know. We just want to be able to say that it might be a tag. We add a flag like \r or \a or any character that we removed before. If I hear one person complain the message size increase, I'm going to choke them. After we check if the first part is a tag, we can then do a substr() to the next occurrence of that flag.

For parameters, I would use a different flag. As an example \a. They are a little bit trickier. Generally, they are pretty close to tag names. As an example, that would look like:
Code: [Select]
Hello \r[img \aalt=&quot;Hello World&quot\a]\rworld.png\r[/img]\r

Not sure if the flags should be on the inside of the [] and if there should be another flag around the tags and parameters themselves. As an example:
Code: [Select]
Hello [\rimg\f \aalt\a=&quot;\nHello World\n&quot;\r]\aworld.png\t[/img]\t
This removes the need for any regular expressions.


Ugh, now I really don't even want to use flags. I just want to do this the right way and make it into an array. That's the next step though and I don't want to go down a rabbit hole.
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 27, 2017, 08:42:11 pm
I spent a few minutes writing out an example parser:

Code: [Select]
<?php

$msg = "Hello [\rimg\f \aalt\z=&quot;\nHello World\n&quot; \atitle\z=\nHi\n thisisempty\n\n\r]\qworld.png\t[/img]\t";

$nextChar = strpos($msg, "\r");
if ($nextChar === false) {
    cleanString($msg);
}

$newMsg = '';

// I don't think we need a do/while, actually. A while should work.
do {
    $tagEndPosition = strpos($msg, "\f", $nextChar);
    $tag = substr($msg, $nextChar, $tagEndPosition - $nextChar);
    if (isset($bbc[$tag])) {
        $checkCodes = $bbc[$tag];
       
        // Find the next \r
        $paramStringEndPos = strpos($msg, "\r", $tagEndPosition);
       
        // If the next \r is not 0 or 1 difference:
        // Find the next \a
        $paramEndPos = $tagEndPosition;
        $params = [];
        while(false !== ($paramPos = strpos($msg, "\a", $paramEndPos))) {
            // Find the param
            $paramEndPos = strpos($msg, "\z", $paramPos);
            $param = substr($msg, $paramPos, $paramEndPos - $paramPos);
           
            // Get the value (empty values use \n\n to be easy)
            $valuePos = strpos($msg, "\n", $paramEndPos);
            $valueEndPos = strpos($msg, "\n", $valuePos);
            $value = substr($msg, $valuePos, $valueEndPos - $valuePos);
           
            // Comma delimited strings are a bit different, but their key would just be 1, 2, 3, etc.
            $params[$param] = $value;
        }
       
        ksort($params);
       
        $foundCode = false;
        foreach ($checkCodes as $code) {
            if (!checkRequiredParameters($code, $params)) {
                continue;
            }
           
            $optionalParameters = array_diff_key($code->getRequiredParameters(), $params);
           
            if (!checkOptionalParameters($code, $optionalParameters)) {
                continue;
            }
           
            // Cool... now we have a tag and parameters.
           
            // Next step is to get any content it has. You can guess how that will go.
            // Finally, we parse the code.
            parseCode($code, $params, $content);
        }
       
        $nextChar = $foundCode ? $closingTagClosingPos : $valueEndPos;
    } else {
        cleanString($msg);
    }
} while ($nextChar = strpos($msg, "\r"));


function checkRequiredParameters($code, array $params)
{
    foreach ($code->getRequiredParameters() as $requiredParameter) {
        if (!isset($params[$requiredParameter])) {
            return false;
        }
    }
   
    return true;
}

function checkOptionalParameters($code, array $params)
{
    return array_intersect($params, $code->getOptionalParameters()) !== array();
}

// Clean any remaining special characters
function cleanString($msg)
{
    return str_replace(["\r", "\f", "\a", "\n"], '', $msg);
}
Title: Re: Number of BBCode params and permutation
Post by: Joshua Dickerson on January 27, 2017, 08:42:56 pm
There is no need to do the ksort actually. All of that would happen in the preparser.