ElkArte Community

General => Chit Chat => Topic started by: Joshua Dickerson on August 14, 2015, 05:56:08 pm

Title: BBC Parsing
Post by: Joshua Dickerson on August 14, 2015, 05:56:08 pm
Working on a new parser but not working entirely from the start. Just taking what is there now and "iteratively" changing it.

https://gist.github.com/joshuaadickerson/c528aae77d1cf0de029d

The goal is to make the code nicer with more ability to change things. The biggest thing is taking the code generation out of the parser. Also, taking it out of Subs.php. As a side effect, hopefully it is also faster and less resource intensive.

Everything works except itemcodes. They really are complex so I'm still trying to figure out where I broke them.

I would appreciate coming up with more messages. Especially ones that might break it.

Tester.php is the entrance to it but it really doesn't do much yet. I am going to make it much nicer shortly.
Title: Re: BBC Parsing
Post by: Spuds on August 14, 2015, 06:54:12 pm
Sounds awesome Josh  :D 

I've looked at, and implemented, two different "3rd" party parsers with good results.  Good in that it seemed to parse whatever I threw at it.  The downside was that there were both to slow as compared to the current function (which got some micro opto love to squeeze a bit more out of if). 

I think they just ended up with two many object or whatever floating around, the biggest disappointment  was they are also much slower for a post with no BBC, so plain text posts, lagged behind by a double digit percentage.

Anyway IMO its fine for a replacement to be a bit slower than what we have now if it gives us more readability, maintainability, extensibility, and any other ibilty's that I forgot :D

I have a few test cases that I can toss at it when you get a bit further along, looking forward to it !!!
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 14, 2015, 07:05:54 pm
Post (or gist) them and I'll add them. Besides itemcodes, which I know are totally screwed right now, I want to see what it takes to break it. I think the issue with itemcodes is outside of the itemcodes function itself. So, if I can find a message that breaks it, I might be able to fix itemcodes lol.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 15, 2015, 12:19:53 am
I hate Bootstrap so much right now. I just want code to look nice with overflow and highlighting. Once I get that part figured out, I will have a nice GUI for this thing. Right now, it looks okay, but is not that nice.

Oh, and as for benchmarks - old is killing new. Trying to figure that out now.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 15, 2015, 12:32:43 am
Just throwing something that might be dumb, but wouldn't it be possible to render those codes on client side instead of exclusively server-side? I know that php cannot be executed on a regular browser and needs to run on a server but what if there was a way to have the browser helping somehow?
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 15, 2015, 01:21:48 am
SMF has always worked the same with or without JS on (save a couple of things in admin). So, I guess Elkarte would be in the same boat.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 15, 2015, 02:18:19 pm
All tests are passing. Which is pretty awesome. Even added a bunch more messages to test.

Now, to figure out what is making it take nearly double the time.

Tester.php is no longer the entrance. It is now the index.php file.

It could certainly stand to look better but I am not in the mood to wrestle with Bootstrap's handling of
 and <code> anymore.
Title: Re: BBC Parsing
Post by: Spuds on August 15, 2015, 02:50:01 pm
Sounds like great progress, hope you can determine whats causing the speed issue.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 15, 2015, 03:40:42 pm
I set it up to show me what has the largest percentage time difference. Consistently, this string is the top of the list. Thought it was ironic.

Code: [Select]
[quote="Edsger Dijkstra"]If debugging is the process of removing software bugs, then programming must be the process of putting them in[/quote]
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 15, 2015, 05:03:43 pm
Code: [Select]
		foreach ($this->bbc as $k => $v)
{
if ($tag !== $v['tag'])
{
unset($this->bbc[$k]);
}
}

is 50% faster than

Code: [Select]
		$this->bbc = array_filter($this->bbc, function ($ele) use ($tag) {
return $ele['tag'] !== $tag;
});
On PHP 5.3.
Title: Re: BBC Parsing
Post by: Spuds on August 17, 2015, 06:59:43 am
Been poking through and playing with the code, thats a very nice refactor so far :D awesome job Josh!

In terms of speed, it looks like the biggest time delta is in the recursion call of handleParsedEquals, not sure how to improve that TBH.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 12:36:48 pm
Oh, I have made a ton of changes so far. I will upload them shortly. I am also, at the same time, working on a complete rewrite using preg_split(). I could probably use preg_match() but I want to keep it fairly simple.
Title: Re: BBC Parsing
Post by: Spuds on August 17, 2015, 03:38:42 pm
Cool ... look forward to what you have done. 

I may take a couple of the simple optimizations (substr_xxx) and stuff them back in 1.0.5 as well.   That way its as fast as possible and then you really have to work to beat it :P (well in terms of speed), the new structure is already leaps better !
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 05:54:48 pm
Yeah, working on the optimizations, I could just throw them back in to it. I could definitely this without using objects but I really think that is the way to write good code. I had to do some stuff so I got side tracked and I also broke it last night before I went to sleep so I didn't upload them yet. Just got my computer back, so I should (in theory) be able to work much faster now. @Flavio93Zena let me use his test site with his online control panel so I could upload things to it. Awesome but I have to upload the changes any time I want to test something. I am also working with 5.3 which is 6 years old. I want to test 5.4 and up. I imagine HHVM and PHP 7 are much faster, especially dealing with objects.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 06:27:45 pm
With the preg_split() parser, we could store the messages in their split form. Right now it is splitting by (\[/?$tag) and {\]) ($tag is the tag, not the regular expression). I guess if it splits by (\[/?.*)[\s\=\]]) (wrote that in this window so I don't know if it even makes sense) then it makes it so that any word or itemcode going forward can be added or removed.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 06:40:24 pm
@Spuds I updated it.

If anyone is looking at it, index.php is the entrance for testing. The controls window is screwed up but it isn't a concern for me because I know what the inputs are.

There are currently two types of tests: bench and test (index.php?type=bench or index.php?type=test).

You can select individual messages to test by clicking the checkbox next to the message and clicking the submit button at the bottom of the page. If you don't select any messages and hit submit, it will do all of the messages.

There are no checkboxes for the benchmark test but you can put the message(s) you want in the address bar and it will do only those messages (for instance ?type=bench&iterations=1000&msg=20). It will always run "code" and "all" tests because selecting messages was an after-thought. As you can see from the example, iterations=# is the way to input the number of iterations it will run.

In index.php there is a constant called SAVE_TOP_RESULTS. Right now this is setup to save the top 5 percentage of difference between the A and B test to a CSV file.  You can view the results of that from TopResults.php. This helps me determine what is the biggest slow ups. I could change this to the top 5 in time but if message 20 takes 5 seconds on A and 10 seconds on B that's a difference of 5 seconds which is a 100% increase in time but if message 10 takes 2.5 seconds on A and 7.5 seconds on B that's a difference of 5 seconds which is a 200% increase and tells me there's a bigger difference in the code. At least, that's my logic. Going to also try to look at times as a difference to try to pick any low hanging fruit.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 07:12:46 pm
Okay, cloning the parser isn't cheap but it certainly doesn't make up much of the difference in time. The only test that even calls that function is (currently) test 36
Code: [Select]
[quote=Edsger Dijkstra]If debugging is the process of removing software bugs, then programming must be the process of putting them in[/quote]
It only calls it once. On my last run of 1000 it only accounted for 0.12114 seconds difference out of 27.86 seconds for the entire test. 34, 40, 42, 31, and 39 make up the top 5 in terms of causing the most extra time. I think the pattern is nested tags are taking longer. I am going to add more messages to test that theory (that's usually how more messages have been added).

Still need to add more messages though. I know there are a bunch of issues from SMF and Elk with BBC that aren't being captured in these messages. There's no UTF-8 characters. I'm missing a lot of nesting scenarios.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 08:07:39 pm
One thing I really don't like about the benchmark is that it can't actually tell me the memory loads because it is using the same process. I am thinking about getting rid of that altogether.

I am going to post this here because I don't have a real git repo setup (not on my computer yet) so I can't branch this. This is an in-situ example of parser. I don't see any difference between it and the previous version. In theory it should be using less RAM but I honestly don't know. Especially with PHP. I know in other language it wouldn't be making a copy and would use the reference.

Then again, there are a LOT of functions that create copies of the message. I don't think PHP has any in-situ string methods. I'm actually thinking about making a patch to add some. Would give me time working with C - something I haven't done in over 10 years. What we should do is make any internal functions Elkarte uses that makes changes to a string use references. I don't think it will save much time, but on long messages I think it will save memory.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 10:31:18 pm
Some related things from the SMF bug tracker:
http://dev.simplemachines.org/mantis/view.php?id=4359
http://dev.simplemachines.org/mantis/view.php?id=1655
http://dev.simplemachines.org/mantis/view.php?id=2368
http://dev.simplemachines.org/mantis/view.php?id=3984
http://dev.simplemachines.org/mantis/view.php?id=3783
http://dev.simplemachines.org/mantis/view.php?id=3474
http://dev.simplemachines.org/mantis/view.php?id=2237
http://dev.simplemachines.org/mantis/view.php?id=3295 (I removed FTP from Parser.php but still)

Features
http://dev.simplemachines.org/mantis/view.php?id=3147
http://dev.simplemachines.org/mantis/view.php?id=3436
http://dev.simplemachines.org/mantis/view.php?id=1000 (might be better to handle with a new Codes? Would have to get more info)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 17, 2015, 11:36:52 pm
Finally figured out what the recursive parsing is for and added a new message for it. You are right, @Spuds, that takes a lot of time. Still not enough to account for the massive difference. Especially since it's such a rare thing to have bbc in the author tag of a quote.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 18, 2015, 01:58:24 am
Feel free to upgrade to 5.4 in the admin panel if you want to. It's the maximum they currently have :/ But AFAIK there's quite a big difference between the two.
Title: Re: BBC Parsing
Post by: emanuele on August 18, 2015, 02:22:51 am
Quote from: Joshua Dickerson – I am also working with 5.3 which is 6 years old. I want to test 5.4 and up. I imagine HHVM and PHP 7 are much faster, especially dealing with objects.
If the goal is Elk 1.1, then php 5.3 is the reference...
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 18, 2015, 02:47:46 am
Hmm, could I argue placing it 5.4 but keeping 5.3 compatible? Joshua still has a point there, 5.3 is old.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 18, 2015, 03:18:52 am
I don't know if I broke something or if 5.4 really is that much better but I saw it go from nearly twice as slow to almost the same speed using 5.4.

PHP 5.3
Code: [Select]
Messages: 73
Iterations: 500
Total Time In Tests: 14.77
Total Old Time: 5.72
Total New Time: 9.05
Diff Total Time: 3.33
Diff Total Time %: 8.42

PHP 5.4
Code: [Select]
Messages: 73
Iterations: 500
Total Time In Tests: 10.39
Total Old Time: 5.57
Total New Time: 4.81
Diff Total Time: 0.76
Diff Total Time %: 4.71
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 18, 2015, 03:25:04 am
That makes me feel a hell of a lot better. I was about to say screw using objects. An 84% increase in time isn't going to work. If you are still using 5.3, you deserve to have it take a little more time. Maybe it will push people to upgrade to >= 5.4. Especially since 5.3 EOL'd a year ago (last week).
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 18, 2015, 03:35:44 am
Wait!!! I was so used to old always beating new, I completely missed that new is faster now! :D :D :D I don't know why or how but this has made me so happy.

Consistently too! Just needed to use PHP 5.4 I guess.

Code: [Select]
Messages: 73
Iterations: 1000
Total Time In Tests: 19.85
Total Old Time: 10.6
Total New Time: 9.25
Diff Total Time: 1.35
Diff Total Time %: 9.73

Code: [Select]
Messages: 73
Iterations: 1000
Total Time In Tests: 20.27
Total Old Time: 10.78
Total New Time: 9.49
Diff Total Time: 1.28
Diff Total Time %: 9.9

Code: [Select]
Messages: 73
Iterations: 1000
Total Time In Tests: 20.22
Total Old Time: 10.74
Total New Time: 9.48
Diff Total Time: 1.26
Diff Total Time %: 9.86

There are still some changes that I can make and it needs more messages to test, but right now my code looks close to production for at least a mod. EDIT: forgot that it doesn't test for or handle disabled tags properly. Need to do that. before anything.

Now I have to edit the tests so that it shows me the difference between old and new and not just max/min.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 18, 2015, 04:51:29 am
No idea how you are doing it, but I feel you are doing an amazing job, and I am glad my test site is being useful for such an important task!
Title: Re: BBC Parsing
Post by: Spuds on August 18, 2015, 07:33:23 am
Just downloaded the latest to play around with :D  Josh you are some night owl, I crap out by 9:30-10, but I do get up at 5:15ish.

I added a few tests to the messages array, nothing fancy but just a few more (and added the needed funcs in the helpers file).  Running on 5.4 (on my windows machine, should try it on my vps as well, but thats another day) I'm getting

Messages: 100
Iterations: 100
Total Time In Tests: 18.4
Total Old Time: 8.48
Total New Time: 9.91
Diff Total Time: 1.43
Diff Total Time %: -16.84

Not as fast on my system but doing pretty damn well, may be due to the extra tests I placed in the file. 

Also I  changed the way you were calculating the %'s, you can't do the max( , ) / min( , ) for the percents, you need to maintain one as the baseline and then delta and percent off of that.   So I used the old / current value as a baseline ... its the  ((known - experimental) / known) calculation that you want, with a div by zero check. 

You can see the problem in your examples ... Total Old Time: 10.6, Total New Time: 9.25, Diff Total Time: 1.35, Diff Total Time %: 9.73  ... The % difference is either 12.7% or -14.6% depending on which you used as the baseline, no idea what 9.73 is.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 18, 2015, 07:37:34 am
Don't bother, I am just moral support :D You can do it guys!
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 18, 2015, 12:56:46 pm
Yeah, I realized it was screwed up as soon as I wrote it but it served its purpose in helping me.

Can you upload your changes? I want to work them in.
Title: Re: BBC Parsing
Post by: Spuds on August 18, 2015, 04:22:20 pm
@Joshua Dickerson here they are ... Not many changes at all. 

I added some additional test cases to message.php so you will see those.  One of them required me to add back a bbc code so that unparsed_commas could be tested, TBH I'm not sure that functionality is needed at all anymore, did anything other than glow use it? I guess its harmless anyway.

Changed the percent diff calc, to be based on the "old" code cheated on the divisor = 0 since I'm lazy.  A -% means the old is faster, a +% means the new is faster.

Allowed it to output the HTML to the existing tag output (in TestOuput) just so I could see what that formatted output.

Added those micro optimizations to ParseBBC.php just to make your goal harder :P

Added additional functions to helpers as needed by the new test cases.
Title: Re: BBC Parsing
Post by: radu81 on August 18, 2015, 04:49:54 pm
I don't know what you guys are doing but I'm glad that you are giving so much attention to speed.
If can be useful I can provide my login details to my linode vps where you can switch from php 5.4, 5.5 or 5.6 with just some clicks. I'm in holiday right now but I'll be back in a few days. Just let me know if could be useful for your tests
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 18, 2015, 05:06:03 pm
Cool. Taking a look at it now. Got a screw in my tire so I am going to run to get that fixed in a minute. I think I'm going to add the glow tag back in by using a hook. It would then be able to test that adding tags by hooks works. I was saddened that you didn't have as much fun as I did with the messages. It's a glimpse in to my mind as it goes throughout the day hehe
Title: Re: BBC Parsing
Post by: Spuds on August 18, 2015, 06:59:38 pm
QuoteI was saddened that you didn't have as much fun as I did with the messages
LOL yeah they were pretty bland, sorry about that :D .. but I did enjoy yours quite a lot !  Now if I had done them after my evening homebrew it would have been a different story  O:-)

Adding glow as a hook makes sense, I think thats the only type that is not tested but I need to do a second look.

QuoteI don't know what you guys are doing but I'm glad that you are giving so much attention to speed.
If can be useful I can provide my login details to my linode vps where you can switch from php 5.4, 5.5 or 5.6 with just some clicks. I'm in holiday right now but I'll be back in a few days. Just let me know if could be useful for your tests
Thank you.  If Josh needs another test area I'm sure he will be in contact.  And yes this function is one that we attempt to monitor very closely when we make changes.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 19, 2015, 09:26:30 pm
@Spuds, I updated the gist with your changes. You changed the constant for the tests - parse_bbc(). Going to rename it SpudsParseBBC.php and spuds_parse_bbc() and add different test options:

new parser vs parse_bbc() test
new parser vs parse_bbc() benchmark
Spuds parse_bbc vs parse_bbc() test
Spuds parse_bbc vs parse_bbc() benchmark
new parser vs Spuds parse_bbc() test
new parser vs Spuds parse_bbc() benchmark

And probably change the way I get the tests so when I add the regex parser it isn't more work.

Who is Nicolas Bonet? (I feel like this is going to become like "who is John Galt" but a joke)
Title: Re: BBC Parsing
Post by: Spuds on August 20, 2015, 07:06:30 am
Looks like a couple of minor things were missed ... a repo would help  O:-)

I had changed time_diff_perc => time_diff_percent (sorry) so that has to be updated in BenchOutput, also looks like the new % calc was missed there.  Also I had two more minor updates to ParseBBC ... so those two files are attached. 

In the previously attached BBCHelpers there were a couple of stub functions that are needed for quotes and footnotes.  I did not reattach that file since its the same as the previous one, but the gist needs to be updated with those.

I like the idea of adding in the additional comparisons, be interesting to see if back porting some of the tweaks made a difference or not.  I still need to check our parse "type" coverage to make sure we have test cases for each.

QuoteWho is Nicolas Bonet?
Why do you ask?  What did I miss?



Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 20, 2015, 03:37:29 pm
One of the messages contains an autolink to his FB profile.

My monitor keeps shutting down and Vagrant/VirtualBox are having problems with Win 10 but I initialized the repo in PhpStorm. So, at least I'm tracking changes now.
Title: Re: BBC Parsing
Post by: Spuds on August 20, 2015, 08:42:50 pm
QuoteOne of the messages contains an autolink to his FB profile.
Taken from Mantis  O:-) Be sure to like them :P
Title: Re: BBC Parsing
Post by: emanuele on August 21, 2015, 02:07:28 am
NiBo... :P
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 21, 2015, 03:26:57 am
Maybe I'll shamelessly add my LinkedIn profile for advertising lol
Title: Re: BBC Parsing
Post by: Spuds on August 21, 2015, 06:28:25 am
LOL
Title: Re: BBC Parsing
Post by: emanuele on August 21, 2015, 08:19:50 am
ROFL!

Why not a paypal account? :P
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 21, 2015, 04:19:24 pm
Ooo, going to test this as well: http://www.simplemachines.org/community/index.php?topic=539070.0
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 22, 2015, 04:41:27 am
Finally got it working with all of your messages. It added a ton of time to the tests but I finally wrangled it down. On 500 iterations, I am getting a difference of 0.61 seconds or 5.44%. That's completely negligible if we're parsing 500 crazy messages like these are.

I have to go away for the weekend so I am going to post this here. I have to leave in a few minutes so I don't have time to setup the Github repo.

Going with the iterative idea of this, before I switch all of my attention to the regular expression version, I want to change the codes array to use [1st char] => array([tag] => array(code, code, code...), tag => array(code) ) ). Also, instead of passing the disabled array to the validate/test functions, I want to pass the BBC object.

There a bunch of things I started working on so I don't forget them in the zip. I don't know if I applied all of your changes, but I purposefully skipped some of them.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 25, 2015, 08:46:22 pm
https://github.com/joshuaadickerson/BBC-Parser
Title: Re: BBC Parsing
Post by: Spuds on August 26, 2015, 10:01:01 am
Yea ... a repo !
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 27, 2015, 02:29:00 pm
I did some initial testing on the RegexParser and preg_split() is taking WAY too much time. RegexParser is taking 5x longer and I'm not even doing a fraction of the amount of things the original one does since it's just a proof of concept.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 27, 2015, 02:37:45 pm
Definitely a no-go, yeah
Title: Re: BBC Parsing
Post by: Spuds on August 27, 2015, 02:46:18 pm
I think thats how the other parsers I played with worked.  The benefit was it was extremely flexible and robust, the downside was the speed was subpar compared to what we have.

I'm still impressed with what you have achieved with the re-write, its only 5% slower then the procedural version and of course its significantly better in terms of structure and looks far less frail to updates..
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 27, 2015, 03:15:52 pm
I'm not happy. I was really hoping the preg_split() parser would be faster. I might try some other regular expressions to see if they work better but I doubt it. I think just creating the regex makes it much slower. A regular expression parser makes the parser so much prettier. One way would be to cache the tokenized version. Then again, why wouldn't you just cache the parsed version at that point?

Which... I think I'll make possible. I'll add an attribute to the BBC "ATTR_NO_CACHE" which, if the message contains a BBC with this attribute, it cannot be cached. The only time this would be set is if you create a BBC that depends on the user's permissions. Then the pre-parser would parse the message and cache it. The parser would have a method Parser::canCache() which would do the same thing.

Still going to do some more things. Going to make the smiley parser, HTML parser, and autolink parser separate classes so you can throw in your own. To reduce the complexity, if they aren't in the constructor it will just create them from default. Going to test and see how making each code an object works. Thinking I'll wind up sticking with an array though.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 27, 2015, 05:57:48 pm
A lot of the parser deals with issues that the preparser should pick up. I wonder how much and I wonder if removing that will make it any faster. If posts become mangled, we can always run the preparser on them again to fix it. It can be made part of the upgrade or a separate admin action.
Title: Re: BBC Parsing
Post by: Spuds on August 27, 2015, 06:23:06 pm
What things in the parser do you feel should be part of pre-parse?  Its funny how much that function has grown over time.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 27, 2015, 08:27:40 pm
Not sure yet. Just found out that the messages that we put in there have issues that preparsecode() would have fixed. PhpStorm was saving with \r but I fixed that. Then preparsecode() does a check for [] and replaces that so the parser doesn't deal with it.

Changing the preparser to also use a class. Going to break it down in to smaller pieces and see what can be changed. Already see some low-hanging fruit for performance but that's not the goal here either. The goal is more to make it more manageable and make it easier for mods to extend it.

I need messages to test it with :(
Title: Re: BBC Parsing
Post by: Spuds on August 27, 2015, 08:43:38 pm
Yup, speed is not the primary concern with preparse since its generally only called during a save.   Of course since it does alter a message you need to be sure it does its job very well or it becomes an oatmeal cartoon.

Just to note what you have in your repo is not the preparse from elkarte, I guess its smf 2.? but its not what Elk currently uses.  I think you used parse_bbc from elkare (which again is not the same as 2.?)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 27, 2015, 10:02:57 pm
Yeah, I messed up and picked the wrong preparsecode(). I changed it locally but haven't pushed it up yet.

Still trying to figure out how to make stupid PhpStorm not save the CRLF
Title: Re: BBC Parsing
Post by: Spuds on August 28, 2015, 07:30:56 am
Just thinking, be sure to grab the one from 1.0.5, there was an error in the table code and font codes that were fixed.

Posted in the other thread re lfcr vs lf
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 28, 2015, 03:55:05 pm
Quick update. The changes I just pushed are resulting in a regular 8-12% speedup in the new parser. Pretty much just trying to get the autolink method to be called a lot less. I think in a real site with regular messages the new parser would result in much faster parsing overall. The messages that we have try to get a lot of worst case scenarios.
Title: Re: BBC Parsing
Post by: Spuds on August 28, 2015, 04:59:47 pm
Very cool ...

In the real world most posts are plain text ... so testing with long, medium and longish posts is going to cover 60% of what parse_bbc really does.  Then 30% with quotes, and the remaining will be a mixture of bbc with most being the basic  b/i/u etc stuff.   Yes I pulled those numbers out of my plumbers crack, but I bet they are not that far off.

One area that often abuses bbc is signatures, thats an area that we could consider caching the output as well.

Can always add it to 1.1, I don't think anyone would complain  O:-)
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 28, 2015, 09:28:47 pm
@Joshua Dickerson
Post this if you need an abused message. Wall of code alert. BBC abuse alert.

EDIT: exceeded 20k characters limit, LOL.

Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 28, 2015, 09:44:59 pm
If someone posted a message like that, I'd just wind up deleting their message. Is that a message for the preparser to check? I don't see people posting anything like that all that often so I wouldn't say we need to check for anything like that.

It wouldn't be hard to add a mod in the preparser to do parsebbc() and check the time. If the time is greater than a limit (0.25 seconds?) it should kick it back with an error that there is too much BBC. Though, I'm not looking to waste my own time developing it because I don't see it happen... ever.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 28, 2015, 09:57:01 pm
It's a page, but it can serve as a really really abused message :P
Title: Re: BBC Parsing
Post by: emanuele on August 29, 2015, 01:18:10 am
Quote from: Joshua Dickerson – If someone posted a message like that, I'd just wind up deleting their message. Is that a message for the preparser to check? I don't see people posting anything like that all that often so I wouldn't say we need to check for anything like that.
I feel you never looked around that much in other boards. :P
That's not really unusual, it's one of the many possibility people come up with. It can be done with a bit of php? Most likely. But if people have to learn a "whole" programming language just to show a page with some staff or people, they tend to just put things together with bbc.
It's just normal, I would try to avoid make assumptions on what people could and could not write.
Have a look at wcrpg (or whatever is called today) and what kind of messages you can see there, the amount of bbc is completely different e from what you see here, and that's good.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 29, 2015, 02:07:39 am
However, it's still a PITA to keep it tidy and find occasional screw-ups ;D But yes sometimes people end up using tons of BBC for many many purposes that we may find "odd" or somewhat uncommon, but it's just "us". They'd be like "huh? I have been using this for ages! Why can't I use it anymore? F@&ck this, I am going to switch to something that allows me to use it the way I always did"
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 29, 2015, 02:14:02 am
Quote from: emanuele –
Quote from: Joshua Dickerson – If someone posted a message like that, I'd just wind up deleting their message. Is that a message for the preparser to check? I don't see people posting anything like that all that often so I wouldn't say we need to check for anything like that.
I feel you never looked around that much in other boards. :P
That's not really unusual, it's one of the many possibility people come up with. It can be done with a bit of php? Most likely. But if people have to learn a "whole" programming language just to show a page with some staff or people, they tend to just put things together with bbc.
It's just normal, I would try to avoid make assumptions on what people could and could not write.
Have a look at wcrpg (or whatever is called today) and what kind of messages you can see there, the amount of bbc is completely different e from what you see here, and that's good.
I try to avoid forums that use excessive formatting. If you need it, you obviously aren't saying anything important enough for me to read.

I can add that and see how it does but I will need to make changes to Nginx and PHP on my test server to allow it to run long enough to benchmark. As it is now, I have to reduce my benchmark to 500 iterations from 1000 because Spuds and I added so many messages.
Title: Re: BBC Parsing
Post by: Spuds on August 29, 2015, 08:10:57 am
Thats a fun stress test :D Its always good to have something extreme that works with the old parse to see if the new one can handle it (and seems to work fine BTW)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 30, 2015, 01:16:14 am
It works fine but gets destroyed on time :(

Attached a zip of the profile.

Sorry @Flavio93Zena, this is going to be hard for you to understand and you need a program that can read the file.
Title: Re: BBC Parsing
Post by: Flavio93Zena on August 30, 2015, 01:42:51 am
No worries, I entirely stopped trying to understand on the parser, way too complicated.
Title: Re: BBC Parsing
Post by: Spuds on August 30, 2015, 08:10:48 am
QuoteIt works fine but gets destroyed on time
Strange, when I ran it it was not much different, at least from the output of of the test parser program.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 30, 2015, 01:24:49 pm
What version of PHP?
Title: Re: BBC Parsing
Post by: Spuds on August 30, 2015, 03:59:22 pm
5.3 on my local
Code: [Select]
Messages: 128
Iterations: 100
Total Time In Tests: 56.17
Total Time (A): 27.68
Total Time (B): 28.5
Diff Total Time: 0.82204399999999
Diff Total Time %: -2.97
The abuse one is
Code: [Select]
pass 	21.574234 	21.923254 	-0.349 	-1.62
Title: Re: BBC Parsing
Post by: Joshua Dickerson on August 30, 2015, 07:56:40 pm
I guess I must have had the profiler on. I am regularly getting results like this with 1 out of 10 or so showing A to be faster than B.
Code: [Select]
Messages: 128
Iterations: 100
Total Time In Tests: 16.66
Total Time A (Old parse_bbc): 8.79
Total Time B (Parser): 7.87
Diff Total Time: 0.921209
Diff Total Time %: 10.48

Code: [Select]
pass	5.296251	4.543548	0.7527	14.21

So, it looks like 5.6 is way faster than 5.3 but that could just be our machines. What is almost certain is that 5.6 works way better with objects than 5.3. Also, I need to always make sure I have profiling turned off.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 03, 2015, 11:40:36 pm
Code: (message) [Select]
[img alt="My image" height="100" width="100"]http://www.google.com/img.png[/img]
Code: (Old) [Select]
<img src="http://www.google.com/img.png" alt="*quot;My image*quot; height=*quot;100*quot; width=*quot;100*quot;" style="" class="bbc_img resized" />
Code: (New) [Select]
<img src="http://www.google.com/img.png" alt="*quot;My image*quot; height=*quot;100*quot; width=*quot;100*quot;" style="" class="bbc_img resized" />

Replace * with & because of the bug.

[img alt="my image" height="10" width="10"]http://www.elkarte.net/community/themes/default/images/_besocial/logo_elk.png[/img]
There's a bug in the old parser which means there's one in the new. Quoted attributes don't seem to work.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 12:21:27 am
Okay, not really a bug in the parser. It's a bug in the BBC array. Doesn't make sense to have alt without quotes. Test 56 still works as expected.
Title: Re: BBC Parsing
Post by: live627 on September 04, 2015, 01:04:14 am
Would SplFixedArray help in any way?

http://php.net/manual/en/class.splfixedarray.php
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 01:07:05 am
For what? In earlier versions I toyed around with SplStack for the opened tags but I decided not to use it.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 01:13:47 pm
Code: [Select]
$possible[Codes::ATTR_BEFORE] = isset($possible[Codes::ATTR_DISALLOW_BEFORE]) ? $tag[Codes::ATTR_DISALLOW_BEFORE] : $possible[Codes::ATTR_BEFORE];
$possible[Codes::ATTR_AFTER] = isset($possible[Codes::ATTR_DISALLOW_AFTER]) ? $tag[Codes::ATTR_DISALLOW_AFTER] : $possible[Codes::ATTR_AFTER];

The isset() there is never true in any of the test messages. Not sure what message I would need to make it true. I am trying to figure out what $tag does there.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 03:04:33 pm
I am thinking about adding another BBC type for footnotes. It would be TYPE_PARSED_CONTENT_COUNTED. Footnotes would be the only type with this and I can't see another type having it, but it would make footnotes not so hacky.

For each tag it would add an element to $counted_bbc with the tag name as the key. For each tag found, it would save that code. So, instead of doing some kludgery with %fn%, you would just save the message stub in the code there and retrieve that later. It would save the count which would get reset for each message and save a total which wouldn't get reset at all. I guess I need to code it out to show you. Maybe "counted" isn't the best term.

hmm... I guess it could also be used it for images to make a gallery at the end of the message.
Title: Re: BBC Parsing
Post by: Spuds on September 04, 2015, 03:34:24 pm
Quote from: Joshua Dickerson –
Code: [Select]
$possible[Codes::ATTR_BEFORE] = isset($possible[Codes::ATTR_DISALLOW_BEFORE]) ? $tag[Codes::ATTR_DISALLOW_BEFORE] : $possible[Codes::ATTR_BEFORE];
$possible[Codes::ATTR_AFTER] = isset($possible[Codes::ATTR_DISALLOW_AFTER]) ? $tag[Codes::ATTR_DISALLOW_AFTER] : $possible[Codes::ATTR_AFTER];

The isset() there is never true in any of the test messages. Not sure what message I would need to make it true. I am trying to figure out what $tag does there.
I think those are only used when you trigger the disallow parents ... so nested codes that would render bad html.  Don't have an example off the top of my head though, but they are out there.

On footnotes, well you have that opportunity to try that with the class,  but if you can find a less kludgery way in the old procedural version have at it as well !


Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 04:00:09 pm
@Spuds, I think I know why it's not working. It is only for disallowed tags and I haven't tested that yet. I think the parser is very broken with that. I am going to add some stuff to allow us to check that more easily.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 06:41:05 pm
Quote from: Joshua Dickerson – @Spuds, I think I know why it's not working. It is only for disallowed tags and I haven't tested that yet. I think the parser is very broken with that. I am going to add some stuff to allow us to check that more easily.
Nope, that's not it either. Doesn't result in true for any of the messages with this: index.php?type=test&a=Old+parse_bbc&b=Parser&disabled_tags=code,quote,size,color,url,list,li,table,tr,td,th,b,s,i,u,
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 06:59:50 pm
Bah! I'm an idiot. I was reading it as disabled. It's disallow. Stupid me. Anyway, just pushed a message that checks it. Checked it on both new and old and it's broken on both.

The fix is here: https://github.com/joshuaadickerson/BBC-Parser/commit/2fbeeca132973844530d89af563d257850410b0e
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 08:09:30 pm
I am also thinking that the BBC parser shouldn't handle smileys at all. It should just add the markers for where the smiley parser should/shouldn't parse and then return it. This way you don't have to even include the BBC parser if you don't want to and makes it a lot easier to change the smiley parser and leaves less for the BBC parser to do.
Title: Re: BBC Parsing
Post by: Spuds on September 04, 2015, 08:22:08 pm
Quote from: Joshua Dickerson – Bah! I'm an idiot. I was reading it as disabled. It's disallow. Stupid me. Anyway, just pushed a message that checks it. Checked it on both new and old and it's broken on both.

The fix is here: https://github.com/joshuaadickerson/BBC-Parser/commit/2fbeeca132973844530d89af563d257850410b0e
Damn ... what does that make me then ...  https://github.com/elkarte/Elkarte/commit/651d137647ed2d4a2752d3f607142e0bca28d95c :P :P ... great catch BTW,  Now if I had any idea what I was thinking 2 years ago.  
Quote from: Joshua Dickerson – I am also thinking that the BBC parser shouldn't handle smileys at all. It should just add the markers for where the smiley parser should/shouldn't parse and then return it. This way you don't have to even include the BBC parser if you don't want to and makes it a lot easier to change the smiley parser and leaves less for the BBC parser to do.
Possibly .. yes .. hummm I feel I'm missing something though, about the smileys.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 09:33:22 pm
You just had a typo. I was looking at it for a while.

You probably think you're missing something because the old parser was so much more complex. I thought that it was so much more complicated than it is. The BBC parser adds \n around text that shouldn't be parsed. Then it explode()s the string on those and parses every other element in that array. If parsing is disabled, the smiley parser does the entire message.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 04, 2015, 09:52:03 pm
For historical reference:

I tried this out but it was much slower:
Code: [Select]
	protected function newParseSmileys()
{
$message = '';
$offset = -1;
$smiley_block = true;
do
{
$offset++;
$next_marker = strpos($this->message, "\n", $offset);
$length = max(0, $next_marker - $offset);
$stub = substr($this->message, $offset, $length);

if ($smiley_block)
{
parsesmileys($stub);
$message .= $stub;
}
else
{
$message .= $stub;
}

$offset = $next_marker === false ? $offset : $next_marker;
$smiley_block = !$smiley_block;
} while ($next_marker !== false);

if ($offset !== strlen($this->message))
{
$stub = substr($this->message, $offset);
parsesmileys($stub);
$message .= $stub;
}

$this->message = $message;
//return $message;
}
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 05, 2015, 12:43:51 am
I added the license stuff. Let me know if I missed anywhere. Can I license this project as

Changing out the smiley parser was really easy.

Now there are 4 separate classes for parsing: BBC, HTML, smileys, and links. There will need to be at least one new global. I hope it would be a DIC.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 06, 2015, 02:37:33 am
The latest commit (at the time of me writing this) is the base for a converter. I have had a bit much to drink tonight so I'm not going to continue with it, but I want to make it possible to convert the old format and create new bulletin board codes with a GUI. I realized, quickly, that an exporter will be very limited because I can't (or don't want to do all of the work to) parse the PHP out of the BBC array. What you see now is when I figured that it would be too complicated and needed to change my focus.
Title: Re: BBC Parsing
Post by: Spuds on September 06, 2015, 07:14:34 am
QuoteThe latest commit (at the time of me writing this) is the base for a converter. I have had a bit much to drink tonight so I'm not going to continue with it
ROFL ... I've made some of my best commits under those conditions :P  So the converter was to try and convert the old bbc array to the new config, to help convert addons or other?   That may be pretty difficult, maybe just a basic wiki page would be enough to help addon authors etc?

All of the license headers look correct to me as well, good job !

Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 06, 2015, 07:35:03 am
The converter would work fine if I didn't have to worry about anonymous functions and inline conditions. Actually, the converter will still work. I just can't export. So you can add a middleware for the BBC but that's not that same as exporting a new package.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 08, 2015, 09:29:16 pm
This idea is about performance. What if, instead of doing autolink() on run, we do autolink() on parsing. We could either add another tag ([autolink]) or add another parameter to url so we know it's an autolink. Then, when we unparse() just remove the tag. Autolink accounts for 7.2% of the total time to parse. That would mean just adding another tag to a message instead of a regular expression (regular expressions are killer on this parser). I think the preferred method is to use another tag because parameters, tests, and regular expressions in general are the slowest part of the entire parser.

Major problem with that is it requires running the regex on all messages when you upgrade. I guess you can do an instr() and look for www. or :// or @ and then return the ID to do the regular expression in PHP.
Title: Re: BBC Parsing
Post by: emanuele on September 09, 2015, 02:02:46 am
Upgrade and conversions...
Title: Re: BBC Parsing
Post by: Spuds on September 09, 2015, 08:02:23 am
Could you do it as part of preparse .. find autolink-able URLS and then wrap them in a [autolink] tag.  Then parsebbc works on those and avoids that preg.  We can strip the autolink tag on edit.  Or is that what you mean (sorry brain is a bit slow this AM). 
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 09, 2015, 04:52:07 pm
That's exactly what I mean ;)
Title: Re: BBC Parsing
Post by: Spuds on September 09, 2015, 09:21:10 pm
If that preg is that large a performance hit, then it makes a lot of sense to set it up so it only has to run once on save and/or modify
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 09, 2015, 09:46:06 pm
Okay, the upgrade/converter will take on a lot more load. A test for that would require a database dump to see how much longer it will take.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 27, 2015, 02:56:48 pm
Adding any codes significantly increases the amount of time it takes to parse. I added email_auto and url_auto to the parser and it increased the amount of time it takes to do the benchmark by 5%. I then changed it to zurl and zemail since there are no 'z' codes. Same result.
Title: Re: BBC Parsing
Post by: Spuds on September 28, 2015, 09:00:20 am
Is that 5% in comparison to the "old" parse_bbc or just your new version? 

So adding in autolinking detection is a 5% hit?

Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 28, 2015, 09:56:43 am
The new one. I am going to see what happens when I add 10 more. Looks like changing it from an autolink to a BBC in the preparser increases parsing time by 5% and decreases it by 7%. So, a 2% win. If that shows to be true, it's not worth it.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 28, 2015, 11:43:38 am
Ugh, the elusive benchmark results. Today I'm not getting any of the same and I don't know why. Adding 10 more codes didn't add any measurable time. I restarted my computer after it had been running for 2 weeks. Maybe the VM was screwed up? It is taking 1/10 the time it was taking yesterday. I tried it with url_auto and z_url (I think I'm going to use z_url just so I'm not adding to the u codes).
Title: Re: BBC Parsing
Post by: Flavio93Zena on September 28, 2015, 11:56:03 am
Can I just question that "2 weeks" thing? It will eventually break down if you brutalize it that way, and it's not good for any of its internal parts.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 28, 2015, 12:43:00 pm
Quote from: Flavio93Zena – Can I just question that "2 weeks" thing? It will eventually break down if you brutalize it that way, and it's not good for any of its internal parts.
I was away and left it in my backpack for most of the two weeks. I took it out once or twice. So, most of the time it was hibernating. If my computer dies because it's on for two weeks straight, I need to get a new computer because I am on my computer about 15 hours a day anyway.
Title: Re: BBC Parsing
Post by: Flavio93Zena on September 28, 2015, 06:40:52 pm
The 15 hours is ok-ish, but let it rest. You sleep as well right? :P
Title: Re: BBC Parsing
Post by: Joshua Dickerson on September 28, 2015, 07:58:14 pm
That's the 9 hours I didn't include in there.
Title: Re: BBC Parsing
Post by: Flavio93Zena on September 28, 2015, 09:13:06 pm
Yeah, then let it sleep by turning it off entirely :P
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 02:10:10 am
I just pushed a commit to make the new preparser working. I need more messages to test.

I also fixed a minor bug with TestBBC.php and started looking at the regex parser again. If you remember, tokenizing the string took a long time and I decided it wasn't worth trying to go down that route since it took so long. Well, if we did that in the preparser and stored it, it wouldn't matter. Also, my regex is very slow. I am sure I can make it faster. I am going to once again put it on the backburner, but it is still a thought in my head. Store the messages as an array or even an AST?

I guess the next step is to fix and update my main repo for Elkarte and commit some changes to get at least the parser working.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 04:48:05 pm
What are some common tags that get disabled in signatures?
Title: Re: BBC Parsing
Post by: emanuele on October 10, 2015, 05:12:26 pm
No idea...
I usually don't disable any. O:-)
Title: Re: BBC Parsing
Post by: Spuds on October 10, 2015, 05:50:49 pm
I guess its the ones that make them obnoxious ... font, size, color, image ... and I can't really see a need for footnote, spoiler or member in there either.  My .02
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 05:53:14 pm
I wonder what a footnote looks like in the news or signatures.

I guess like this...
Title: Re: BBC Parsing
Post by: Spuds on October 10, 2015, 05:58:28 pm
Don't you dare try size, font and color :P
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 06:04:41 pm
I already know what they look like. Try footnote and spoiler in the news ;)
Title: Re: BBC Parsing
Post by: emanuele on October 10, 2015, 06:05:03 pm
ROFL!
Well, spoiler can be used in signatures. From my experience it's not unusual.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 06:26:58 pm
Code: [Select]
		if (!isset($possible['regex_cache']))
{
$possible['regex_cache'] = array();
foreach ($possible[Codes::ATTR_PARAM] as $p => $info) {
// @todo there are 3 options for PARAM_ATTR_QUOTED: required, optional, and none. This doesn't represent that.
$quote = empty($info[Codes::PARAM_ATTR_QUOTED]) ? '' : '&quot;';

// No quotes
if (empty($info[Codes::PARAM_ATTR_QUOTED]) || $info[Codes::PARAM_ATTR_QUOTED] === Codes::NONE)
{
$quote = '';
$end_quote = '';
}
// Quotes are required
elseif ($info[Codes::PARAM_ATTR_QUOTED] === Codes::REQUIRED)
{
$quote = '&quot;';
$end_quote = '&quot;';
}
// Quotes are optional
elseif ($info[Codes::PARAM_ATTR_QUOTED] === Codes::OPTIONAL)
{
// This gets a little tricky. If there was an opening quote, there must be a closing quote.
// If there was no opening quote, there mustn't be a closing quote.
// But, quotes are optional
}

$possible['regex_cache'][] = '(\s+' . $p . '=' . $quote . (isset($info[Codes::PARAM_ATTR_MATCH]) ? $info[Codes::PARAM_ATTR_MATCH] : '(.+?)') . $end_quote. ')' . (empty($info[Codes::PARAM_ATTR_OPTIONAL]) ? '' : '?');
}
$possible['regex_size'] = count($possible['regex_cache']) - 1;
$possible['regex_keys'] = range(0, $possible['regex_size']);
}
I can't think right now because I haven't eaten anything but a bagel today (going to get wings and beer in a minute). This is going to cause me some grief. I think the optional regex is going to require a backref to see if the opening was there.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 10:21:54 pm
With the autolinking in the preparser, I just realized something. If a tag is add/changed that is a parent of the link, it needs to know that it is in a non-autolinked area. Not sure if I should hardcode this in to the parser or have the parser be able to send the code some way of knowing what tags are open and if it is in an autolink area. I am leaning towards hardcoding it right now and then seeing how I would do it better later.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 10, 2015, 10:38:27 pm
Actually, this is complex. I have to parse the entire message and see if the link is supposed to be an autolink. That means pretty much recreating the parser. I wanted to do the same with the itemcodes.

I think the best way to go about this is to make a mod. It would hook in to see if an autolink was done. It would change the message to add BBC markup at the parse time and would do a query to update the database. Next run wouldn't contain any autolinks/itemcodes for that message. It would still show them as regular uris and itemcodes when editing.

This is a micro-optimization and not really something I care that much about so I'm not going to invest anymore time in to it. Maybe add the hooks now but that's about it.
Title: Re: BBC Parsing
Post by: radu81 on October 11, 2015, 02:42:38 pm
I disabled font, size, email, ftp, iurl, tr, td, table & code into signatures
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 11, 2015, 02:46:09 pm
Why iurl and email?
Title: Re: BBC Parsing
Post by: radu81 on October 11, 2015, 02:49:59 pm
They can always use url instead of iurl, for the mail I don't want to encourage the conversations by mail, but use the forum instead. Don't know if this is a good idea
Title: Re: BBC Parsing
Post by: Flavio93Zena on October 11, 2015, 03:22:10 pm
Quote from: radu81 – They can always use url instead of iurl, for the mail I don't want to encourage the conversations by mail, but use the forum instead. Don't know if this is a good idea
Likely, since emails in siggies can probably be snatched by bots.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 14, 2015, 11:06:11 pm
Started work on submitting this for 1.1. There is one problem - there are a lot of different places where parse_bbc() is used. Controllers are easy because I can just add a method (or a bunch of methods) to the Action controller. It's all of the model/helper/*subs files with functions that make it hard.

There is no DIC variable so containing it somewhere means either adding a generic one or adding a variable for each of the 4 parsers plus codes. I am not a fan of using functions or a class with static methods.

I did a test of Display.controller.php to see if what I copied would work. It does. Here's what I did.

Display.controller.php
Code: [Select]
	public function parseMessageBody($message, $smileys_enabled)
{
global $modSettings, $context;

// If the load average is too high, don't parse the BBC.
if (!empty($modSettings['bbc']) && $modSettings['current_load'] >= $modSettings['bbc'])
{
$context['disabled_parse_bbc'] = true;
return $message;
}

if (empty($modSettings['enableBBC']))
{
if ($smileys_enabled)
{
$smiley_parser = $this->getSmileyParser();
$smiley_parser->parse($message);
}

return $message;
}

$bbc_parser = $this->getBBCParser();
$message = $bbc_parser->parse($message);

if ($smileys_enabled)
{
$smiley_parser = $this->getSmileyParser();
$smiley_parser->parse($message);
}

return $message;
}

Action.controller.php
Code: [Select]
	public function getBBCCodes()
{
global $modSettings;

if ($this->_bbc_codes === null)
{
$disabled = array();

if (!empty($modSettings['disabledBBC']))
{
$temp = explode(',', strtolower($modSettings['disabledBBC']));

foreach ($temp as $tag)
$disabled[trim($tag)] = true;
}

require_once(SUBSDIR . '/BBC/Codes.class.php');
$this->_bbc_codes = new \BBC\Codes(array(), $disabled);
}

return $this->_bbc_codes;
}

public function getBBCParser()
{
if ($this->_bbc_parser === null)
{
require_once(SUBSDIR . '/BBC/BBCParser.class.php');
$this->_bbc_parser = new \BBC\BBCParser($this->getBBCCodes(), $this->getAutolinkParser());
}

return $this->_bbc_parser;
}

public function getAutolinkParser()
{
if ($this->_autolink_parser === null)
{
require_once(SUBSDIR . '/BBC/Autolink.class.php');
$this->_autolink_parser = new \BBC\Autolink($this->getBBCCodes());
}

return $this->_autolink_parser;
}

public function getSmileyParser()
{
if ($this->_smiley_parser === null)
{
require_once(SUBSDIR . '/BBC/SmileyParser.class.php');
$this->_smiley_parser = new \BBC\SmileyParser;
}

return $this->_smiley_parser;
}

public function getHtmlParser()
{
if ($this->_html_parser === null)
{
require_once(SUBSDIR . '/BBC/HtmlParser.class.php');
$this->_html_parser = new \BBC\HtmlParser;
}

return $this->_html_parser;
}

I would prefer to put that in a DIC with a closure so you can change it with ease.
Title: Re: BBC Parsing
Post by: Spuds on October 15, 2015, 06:07:27 am
a nice DIC is https://r.je/dice.html ... lightweight and fast, plus its a single file to add, licensed as BSD.  Could consider that, and since we do injections in a few other areas it may not be a bad move.

Past that, I'd almost bite the bullet and static them or maybe get the auto loader working on them (does it already?) .. since parasebbc is so prevalent in the code, really there is darn little that works without it.

Oh .. the parser update is simply awesome sauce :D
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 15, 2015, 06:59:55 am
The autoloader would be fine, but the loading of the codes can take a little bit of time. So, I'd rather cache the instantiation of that.

I am about to commit and push some changes just so you can see what I'm working on. Namely, the /Maker directory. This is going to be a stepping stone for the BBC to be based in the database.
Title: Re: BBC Parsing
Post by: emanuele on October 15, 2015, 03:50:04 pm
Well, with the current codebase, static methods are not that bad, I see them as a step to start making the code look better.
At some point we may be able to remove them almost entirely, but even singleton are an intermediate step for future development.
It's impossible to rewrite everything to be "DIC-proof" in just one go, in my mind 1.x is a set of experiments and changes made to shape the code, then 2.x will be the moment to decide what path follow, and 3.x will be "The Version".
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 15, 2015, 05:16:25 pm
The problem is that one of the main features is that you can have different functions for different areas. I see a couple of parsers, but probably most of them time they'll share the same actual parser: message_parser, sig_parser, news_parser, package_parser, email_parser. Most of them will be the same most of the time, but the big difference would be in the first 3. Say you don't allow colors in messages and no colors or images in signatures but everything is allowed in the news. The way it would work is you'd set disabled in Codes to whatever you want to disable. Then you'd send that object to the parser. I guess you could change the disabled in Codes by BBCParser::getBBC()... hmm... idunno.

Still working on this maker. It is coming along and it is completely written in JS right now.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 16, 2015, 08:37:45 pm
This is the BBC Maker: http://joshuaadickerson.github.io/bbc-maker.html

I am still working on it, but since I'm doing everything the PHP backend will do in Javascript first, this is a good way to see it. I will be pushing to it slowly, but you can already start to use it now to convert your BBC. I am working on getting the default tags' parameters to show up. Next step after that is to get it doing the rules checking (there are actually a lot) and show errors. Then I want to have some help text for each. Finally, it should allow the user to pick whether a field is PHP/raw or a string instead of adding quotes around it with some hints if it can figure that out on its own.

I need help making it look nicer and then I'll need help making it work with Elkarte's templates since it is using Bootstrap right now.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 16, 2015, 09:21:37 pm
Okay, default tags and parameters now work.

Now the headaches with Bootstrap :(
Title: Re: BBC Parsing
Post by: emanuele on October 30, 2015, 04:58:07 pm
Out of curiosity: what would be left to do to bring this one into 1.1? O:-)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on October 30, 2015, 07:24:24 pm
As far as I can tell, nothing, just replacing parse_bbc() with it. I'm not satisfied with that though. I want it to do a lot more so I haven't gotten to do the simple part. If you squash the bugs for 1.1, I'll make it a feature ;)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 03, 2015, 07:07:29 pm
I am thinking about completely refactoring/redesigning the preparser.

Kind of YAML'd the idea. It would utilize much of what I did with the parser already. It would be copying, not extending, but it would take a lot of the same ideas. Instead of using regular expressions, use the parser. When it encounters a tag, it will handle it accordingly. In a future version, we might even be able to just combine the parser and preparser's codes.

Code: [Select]
NEXT_TAG_MUST_BE
TAGS_ONLY_CONTENT
REMOVE_EMPTY
ATTRIBUTE_IS_URL
EQUALS_IS_URL
NO_PARSE
FILTER_CONTENT
FILTER_EQUALS
FILTER_ATTRIBUTE
BLOCK_LEVEL
ADD_PARENT_IF_MISSING
REMOVE_EXTRA_CLOSING

----

b
 REMOVE_EMPTY
code
 NO_PARSE
 BLOCK_LEVEL
color
 FILTER_EQUALS
 search: '~\[color=(?:#[\da-fA-F]{3}|#[\da-fA-F]{6}|[A-Za-z]{1,20}|rgb\(\d{1,3}, ?\d{1,3}, ?\d{1,3}\))\]\s*\[/color\]~'
 replace: ''
li
 ADD_PARENT_IF_MISSING
 list
list
 BLOCK_LEVEL
 TAGS_ONLY_CONTENT
 NEXT_TAG_MUST_BE
 li
nobbc
 NO_PARSE
quote
 BLOCK_LEVEL
 REMOVE_EXTRA_CLOSING
 REMOVE_EMPTY
 ATTRIBUTES
 link
 ATTRIBUTE_IS_URL
s
 REMOVE_EMPTY
table
 BLOCK_LEVEL
 TAGS_ONLY_CONTENT
 NEXT_TAG_MUST_BE
 tr
td
 BLOCK_LEVEL
 ADD_PARENT_IF_MISSING
 tr
th
 BLOCK_LEVEL
 ADD_PARENT_IF_MISSING
 tr
tr
 BLOCK_LEVEL
 TAGS_ONLY_CONTENT
 NEXT_TAG_MUST_BE
 td
 th
 ADD_PARENT_IF_MISSING
 table
url
 EQUALS_IS_URL

This should make it considerably easier to add preparsing. It will also make it so we don't have to worry about changing nobbc to entities or worry about parsing in code tags.

Some other things I want to do with it is limit the number of parameters so we can catch that before they post. Say 5 or so. Also, add another tag for list items. So, when it saves, it will save an itemcode [ *] as [list][li]...[/li][/list]
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 05, 2015, 09:21:20 pm
Code: [Select]
<?php

namespace BBC;

class ParserWrapper
{
protected $disabled = array();
protected $codes;
protected $bbc_parser;
protected $smiley_parser;
protected $html_parser;
protected $autolink_parser;

protected static $instance;

public static function getInstance()
{

}

protected function checkLoad()
{
global $modSettings, $context;

if (!empty($modSettings['bbc']) && $modSettings['current_load'] >= $modSettings['bbc'])
{
$context['disabled_parse_bbc'] = true;
return false;
}

return true;
}

protected function isEnabled()
{
return empty($modSettings['enableBBC']);
}

public function enableSmileys($toggle)
{
$this->smileys_enabled = (bool) $toggle;
return $this;
}

protected function getParsersByArea($area)
{
$parsers = array(
'bbc' => false,
'smiley' => false,
);

// First see if any hooks set a parser.
foreach ($parsers as $parser_type => &$parser)
{
call_integration_hook('integrate_' . $area . '_' . $parser_type . '_parser', array(&$parser, $this));

// If not, use the default one
if ($parser === false)
{
$parser = call_user_func(array($this, 'get' . ucfirst($parser_type) . 'Parser'));
}
}

return $parsers;
}

public function getMessageParsers()
{
return $this->getParsersByArea('message');
}

public function getSignatureParser()
{
return $this->getParsersByArea('signature');
}

public function getNewsParser()
{
return $this->getParsersByArea('news');
}

protected function parse($area, $string)
{
global $modSettings;

// If the load average is too high, don't parse the BBC.
if (!$this->checkLoad())
{
return $message;
}

$parsers = $this->getParsersByArea($area);

if (!$this->isEnabled())
{
if ($this->smileys_enabled)
{
$parsers['smiley']->parse($message);
}

return $message;
}

$message = $parsers['bbc']->parse($message);

if ($smileys_enabled)
{
$parsers['smiley']->parse($message);
}

return $message;
}

public function parseMessage($message, $smileys_enabled)
{
return $this->enableSmileys($smileys_enabled)->parse('message', $message);
}

public function parseSignature($signature, $smileys_enabled)
{
return $this->enableSmileys($smileys_enabled)->parse('signature', $signature);
}

public function parseNews($news)
{
return $this->enableSmileys(true)->parse('news', $news);
}

public function parseEmail($message)
{
return $this->enableSmileys(false)->parse('email', $message);
}

public function parseCustomFields($field)
{
return $this->enableSmileys(true)->parse('customfields', $field);
}

public function parsePoll($field)
{
return $this->enableSmileys(true)->parse('poll', $field);
}

public function parseAgreement($agreement)
{
return $this->enableSmileys(true)->parse('agreement', $agreement);
}

public function parsePM($message)
{
return $this->enableSmileys(true)->parse('pm', $message);
}

public function parseReport($report)
{
return $this->enableSmileys(true)->parse('report', $report);
}

// $modSettings['disabledBBC']
public function setDisabled(array $disabled)
{
foreach ($disabled as $tag)
{
$this->disabled[trim($tag)] = true;
}

return $this;
}

public function getCodes()
{
global $modSettings;

if ($this->codes === null)
{
require_once(SUBSDIR . '/BBC/Codes.class.php');
$this->codes = new \BBC\Codes(array(), $this->disabled);
}

return $this->codes;
}


public function getBBCParser()
{
if ($this->bbc_parser === null)
{
require_once(SUBSDIR . '/BBC/BBCParser.class.php');
$this->bbc_parser = new \BBC\BBCParser($this->getBBCCodes(), $this->getAutolinkParser());
}

return $this->bbc_parser;
}

public function getAutolinkParser()
{
if ($this->autolink_parser === null)
{
require_once(SUBSDIR . '/BBC/Autolink.class.php');
$this->autolink_parser = new \BBC\Autolink($this->getBBCCodes());
}

return $this->autolink_parser;
}

public function getSmileyParser()
{
if ($this->smiley_parser === null)
{
require_once(SUBSDIR . '/BBC/SmileyParser.class.php');
$this->smiley_parser = new \BBC\SmileyParser;
}

return $this->smiley_parser;
}

public function getHtmlParser()
{
if ($this->html_parser === null)
{
require_once(SUBSDIR . '/BBC/HtmlParser.class.php');
$this->html_parser = new \BBC\HtmlParser;
}

return $this->html_parser;
}
}

I haven't completed it yet (busy) but here is what I plan on doing to implement this before we add a DIC.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 22, 2015, 02:32:32 pm
@Spuds, what do you think about the ParserWrapper? Do you think that's the way to go with this?
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 22, 2015, 09:40:56 pm
Too late. Already started. Committing soon. Not well tested but that's what a beta process is for ;)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 22, 2015, 11:13:09 pm
Just pushed the change (https://github.com/joshuaadickerson/Elkarte/commit/2c72157f5dbbdc02959854cea87f36aac80deb1e) to my Elkarte repo. Hopefully I didn't break too much. I didn't add any tests (yet) in the interest of meeting @Spuds goal to get it in 1.1b1.

It took me a lot more work than necessary due to the size of the controllers and I didn't want to add properties to them in this commit. I'd rather get everything working and then worry about getting it looking better.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 23, 2015, 01:26:15 pm
Debating whether or not I want to change the constants' values to be what they were previously so that there is less BC breakage. I mean, no matter what, there is a BC break but maybe it would be less?

Going along with that, I really want to get this "maker" finished" but the parameters stuff is killing me. I just committed some changes.
Title: Re: BBC Parsing
Post by: Flavio93Zena on November 24, 2015, 07:25:40 pm
Just don't get mad with that, nice and easy buddy, it was an untouched monster since smf 1.1, there were (are) reasons for it.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 24, 2015, 07:44:53 pm
Don't get mad at what? What's an untouched monster?
Title: Re: BBC Parsing
Post by: emanuele on November 25, 2015, 08:02:00 am
parse_bbc I guess. :P
Title: Re: BBC Parsing
Post by: Flavio93Zena on November 25, 2015, 12:38:59 pm
^
Title: Re: BBC Parsing
Post by: Joshua Dickerson on November 25, 2015, 11:41:39 pm
Trust me, I know all of the intricacies of parse_bbc ()
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 01, 2015, 11:25:06 pm
This might sound crazy, but BBC types should be classes. Their properties should be what they are allowed to do.

For instance:
Code: [Select]
final class \BBC\Types\ClosedType
{
 const HAS_PARAM    = false;
 const HAS_EQUALS    = false;
 const HAS_BEFORE    = true;
 const HAS_AFTER    = true;
 const HAS_CONTENT  = false;
 const PARSE_CONTENT = false;
 const PARSE_EQUALS  = false;
}

The idea would be to define the rules for each type in the code. This would be fleshed out more and the rules themselves would be defined (in documentation, don't need another class for that). One could then expand the types to make the individual codes lighter. With a syntax tree, it would also make parsing cheaper.
Title: Re: BBC Parsing
Post by: Spuds on December 02, 2015, 10:02:38 am
Thats a very interesting idea nods I think it would further clean things up  .... I'm not convinced it will make parsing cheaper but you did surprise me with how well your new parser works ;)

I saw you made the commit and then pulled it back, rebase problem or are you still doing some restructure?
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 02, 2015, 11:21:03 am
Totally screwed up the repo. Pulled commits from other branches, merged and committed after. Emanuele fixed it on his repo.
Title: Re: BBC Parsing
Post by: Spuds on December 02, 2015, 04:46:04 pm
git noob :P

I never mess up my repos.   OK truth is that I had to increase my beer budget after I started using git ... I'm hoping there is a link there :P
Title: Re: BBC Parsing
Post by: emanuele on December 02, 2015, 05:17:08 pm
I didn't send the PR because I thought you (Josh) wanted to remove the install-related commits, but if it is okay like that I can send it. ;D

ETA: dammit it's still broken! xD
And the tests fail too... :(

I'll try to rebase it again soon.

ETA2: this is the branch https://github.com/emanuele45/Dialogo/tree/BBC2
Title: Re: BBC Parsing
Post by: emanuele on December 06, 2015, 11:56:01 am
I pushed a version that is now able to merge:
https://github.com/elkarte/Elkarte/compare/development...emanuele45:BBC2?expand=1
the problem is that now tests are broken... what should we do?
Title: Re: BBC Parsing
Post by: emanuele on December 06, 2015, 02:36:26 pm
Ok, fixed the tests as well.
Now I only have one doubt: ILA (still) relies on the message id passed by the integrate_post_parsebbc to the ILA function, though with the new parser the id_msg is (correctly) not passed, so we are a bit left without a piece there. It should still work because 'msg' should be in $_REQUEST, but it's a weak assumption, it may work for beta, but a final solution should be found.

Apart from that I guess it could be ready for work. ;D
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 07, 2015, 10:25:45 am
Where does it need the message id? What does it do with it?
Title: Re: BBC Parsing
Post by: emanuele on December 07, 2015, 04:43:00 pm
To get the attachments of each message.
The first way I see to avoid this requirement is to change ILA behaviour using the id_attach instead of going with the number of attachments in the message (though, that would make kind of hard to guess attachments id in posts, unless we use an "unique" random number that is later replaced with the attachment id (sort of what github does).
The second is to attach ILA somewhere else where the id_msg is available.
Title: Re: BBC Parsing
Post by: radu81 on December 07, 2015, 04:52:23 pm
Quote from: emanuele – The first way I see to avoid this requirement is to change ILA behaviour using the id_attach instead of going with the number of attachments in the message

that will give you less headache in the future with openimporter, other forum scripts are using inline images with the id of attachment
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 07, 2015, 05:39:33 pm
Yeah, I would just put the attach id as the id used there. 1, 2, 3, etc is just as, if not more confusing than the attach id.
Title: Re: BBC Parsing
Post by: emanuele on December 07, 2015, 05:56:46 pm
As I mentioned, though, the attach id is available only if we rely on javascript, or if we replace the tag after the post is made...
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 07, 2015, 06:21:46 pm
Yeah, let's do that.
Title: Re: BBC Parsing
Post by: emanuele on December 08, 2015, 02:44:32 am
Argh... even more work before ILA is ready for beta! :'(
Title: Re: BBC Parsing
Post by: Spuds on December 08, 2015, 08:57:58 am
LOL I was thinking the same thing !  But a change like that would have to be in B1 for sure.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 08, 2015, 12:40:40 pm
It makes more sense than adding a message id to the BBC parser.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 10, 2015, 02:32:19 pm
When closing tags get done out of order, the parser silently fixes them. Should the preparser do this?

Example:
Code: [Select]
[b][i]test[/b][/i]

Result:
test[/i]

The parser silently changes that to:
Code: [Select]
[b][i]test[/b][/i]

Just wondering if the preparser should do that. Maybe even remove the extraneous closing i (can't do the tag because I just found a bug)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 10, 2015, 02:33:23 pm
Weird. When I pressed save, this is what it did.
Code: [Select]
When closing tags get done out of order, the parser silently fixes them. Should the preparser do this?

Example:
[*code][b][i]test[/b][i][/*code]
[i]
[i]

Result:
[b][i]test[/i][/b]

The parser silently changes that to:[/i]
[/i][*code][b][i]test[/i][/b][/*code]
[i]


Just wondering if the preparser should do that. Maybe even remove the extraneous closing [/i]
Title: Re: BBC Parsing
Post by: Spuds on December 10, 2015, 03:29:52 pm
Must be the preparser or editor playing around with the tags.

 I know the preparser already does a large host of fixing (tables, fonts, lists, quotes, empty tags, etc) so to answer the question I would think the fixing should be part of that code so its done once.  

Should at some point consider renaming it from preparsecode to preparse_bbc since it does a lot more than code stuff.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 10, 2015, 04:07:39 pm
The class I'm using locally is named \BBC\Preparser and the main method is parse(). It's a radically different preparser that is taking a lot of thinking to get right.
Title: Re: BBC Parsing
Post by: wintstar on December 10, 2015, 06:55:58 pm
After installing a white page with this error message appears:
Fatal error: Class 'BBC\Autolink' not found in C:\xampp\htdocs\elk11\sources\subs\BBC\ParserWrapper.php on line 380

Issues #2305 (https://github.com/elkarte/Elkarte/issues/2305)



Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 10, 2015, 06:57:09 pm
Now that I'm working on the preparser and I have commits for Elkarte, I want to track the changes all the way through the lifecycle and ensure that tests are working from repo to repo. I am moving the messages to a directory called "Messages." That directory contains a class per file which is of the MessageInterface type.

Each class has 4 methods: input, stored, output, and name. Name is just to get a human readable format of it. The input is what you expect to receive from the user when they are posting. This gets sent to the preparser. Stored is what it gets stored as and is the result of the preparser. It is also the input to the parser. Output is the expected result of the parser.

There are a lot of messages which means a lot of tests. I don't have time to convert all of the messages from Messages.php and PreparseMessages.php to classes. If there is someone that wants to help with this, this is a really easy way to get involved in helping Elkarte and learn more about PHP.

If you are interested, feel free to fork the BBC-Parser repo (https://github.com/joshuaadickerson/BBC-Parser), take a look at the most recent commit (https://github.com/joshuaadickerson/BBC-Parser/commit/83640cf9ff8faf5daae282a3e9005482f5439495) and dive in. You can submit pull requests or just create a zip file and upload it here. Feel free to add a comment at the top of the file giving yourself credit.

The structure to read the files hasn't been created yet. It will simply read the directories and create the classes as it needs.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 10, 2015, 07:01:18 pm
That's one for @emanuele. Looks like an autoloader issue.
Title: SPLIT: SCeditor bug handling bbcode inside code tags
Post by: emanuele on December 11, 2015, 08:38:35 am
The discussion about SCEditor bug has been moved to Bug Reports (http://www.elkarte.net/community/index.php?board=2.0)

http://www.elkarte.net/community/index.php?topic=3127.0
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 13, 2015, 04:01:27 pm
https://github.com/joshuaadickerson/BBC-Parser/commit/897ac719bed289ba7aa6a70938028c59c4397521
https://github.com/joshuaadickerson/BBC-Parser/commit/31ed7754154a7bd454b1c1caf720f0c4c9d809f5

The first one is the script to run to create the test message files. The second is the actual files. Feel free to take part in the process. All that's needed is to break them down in to directories (with their namespace) and name them accordingly.
Title: Re: BBC Parsing
Post by: Spuds on December 14, 2015, 09:20:04 am
Looks like the output lines are the same as the input?  Are those supposed to be the html output from the parser instead?
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 14, 2015, 06:30:57 pm
Yeah, I realized that. More commits followed. I forgot the autolink one which I'm not sure if I committed.
Title: Re: BBC Parsing
Post by: Spuds on December 15, 2015, 04:53:44 pm
On my 1.1 install (which may be at fault here) smileys do not parse in a message. 

In a quick look it seems its because $user_info['smiley_set'] is set to '' when you are viewing a topic (loadtheme sets that value) ... so the parser does work, but the image tags are wrong (they lack the smiley sub directory, like default). 

I'm still getting used to the new parser, and I'm not sure if this is because its instance is created before loadtheme is called or something else is going on?  Need some input form the parser master. 

Also in subs where old parse_bbc has been hotwired, there is a return statement which leaves some unreachable code, again could use some input on that.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 15, 2015, 10:55:55 pm
Smilies: I need to take out all globals from these things. They make my life so difficult. They should still parse but the directory doesn't appear to be correct. I don't know how the order of when it is getting the user's theme has changed at all though. From what I can tell - looking at the call tree - loadTheme() is called before parse() which is the only thing that should be setting up the smiley parser at this point. So, I have no idea how that is getting screwed up. I'm looking in to it but I might need some help there.

parse_bbc: I need to clean up the commits done to Elkarte repo. There is a lot of crap that I left that I shouldn't have. Bad programming on my part but I wanted to get it PR'd so we could start testing for beta.
Title: Re: BBC Parsing
Post by: Spuds on December 16, 2015, 09:02:14 am
Not sure on why that value is failing to load in some paths, its very odd ... board index has it loaded, but not when you go to view a topic.  Well its loaded, its just empty instead of = 'default' or whatever.   Your pending PR is likely what needs to be done, that's how I hotwired my local to render them.  I'll poke around a bit more and see if I can figure it out.

nods on the rest, thats why I'm starting to poke around at bit (and thanks for answering my question in your repo, that was the point I was thinking about when I saw that var)
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 16, 2015, 05:12:59 pm
You're right, I left parse_bbc() in there for legacy. I'm thinking I should just remove it altogether since you already need to change your codes to make it work. At least this way it will cause an error if you didn't change them. Then you won't have a mystery bug in your mod. What do you think?
Title: Re: BBC Parsing
Post by: Spuds on December 16, 2015, 06:27:09 pm
Maybe we should just depreciate it in 1.1 and then removed in 1.2 or 2.0, whichever is next, just to be proper. 
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 16, 2015, 11:48:39 pm
In the latest commit I added content tracking. This is inspired by footnotes but I wanted it to be much more versatile. It works for any type that can have content. If I'm not mistaken, that only leaves item codes and closed tags without this ability. It only gets the content though. If your code has before/after, it only tracks the end of the before and the start of the after as the content area.

Right now, all it is doing is tracking where that content is. If you go back in position and change things, it won't have that information. The ability to capture it is there, but I don't really want you to do that since it poses a huge memory consumption issue if abused. There is also no feature to add the footnote link with the count.

So, how would you do that? You get the tracked footnote content using $parser->getTrackedContent('footnote')... easier to explain with code...
Code: [Select]
<?php

$message = $parser->parse($message);

$list = '<ol class="footnotes">';

$pos_padding = 0;

foreach ($parser->getTrackedContent('footnote') as $i => $code)
{
$start = $pos_padding + $code[Codes::TRACKED_CONTENT]['start'];
$end = $pos_padding + isset($code[Codes::TRACKED_CONTENT]['end']) ? $code[Codes::TRACKED_CONTENT]['end'] : strlen($message);

$content = substr($message, $start, $end - $start);
$list .= '<li id="footnote' . $i . '>' . $content . '</li>';

$insert = '<sup class="bbc_footnotes">' . $i . '</sup>';
$message = substr_replace($message, $insert, $start, 0);
$pos_padding += strlen($insert);
}

$list .= '</ul>';

I don't think I'm happy with this solution, but it shows you how much more versatile tracked content is than just a footnotes tag. What I'd like to do is have a variable that you can use in before/after/content which is turned on using tag count tracking. Then inject the count so you are doing it in the parser, utilizing what's available. Or maybe change before/after/content to be a string OR a closure. So many ways to skin this cat.
Title: Re: BBC Parsing
Post by: Spuds on December 17, 2015, 08:38:19 am
Kind of cool ...

Sounds like those location points are static breadcrumbs so to speak? Are they established based on the passed bbc message (marking the bbc tags content start and end ?  so [bla]X.....Y[/bla] where X and Y are the strpos numerics?  If so then in your example you should not be placing them back into $message or after the first substitution, they are all wrong, unless I'm missing something, like my morning caffeine !

Yea some kind of counter would be nice, but I guess that can be done in that loop as well?
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 17, 2015, 07:06:06 pm
It's not tracking the BBC. It is tracking the output. If it tracked BBC, it would need to effectively parse the message twice. First time it would just find all of the codes. If you wanted to make changes it would do it then. The second time is when you'd parse any changes.

I think I want to change the option to "TRACK" and then have track: count, content, captured_content, codes (all of the codes that are found), params (must also track the codes but it will get the parameter key/vals), equals (same as params but just equals), changes (any change that is made in a log format: "found $tag, found $code. inserted after at $pos."). Maybe I'll do all of that anyway and just use the $code array instead of variables like $data.

This is really just leaving us with one path - have a parse tree. Create the tree and then iterate the nodes which will tell us how to parse it.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 18, 2015, 12:18:04 am
Figured out why it isn't putting the correct smiley set in.

loadBoard() gets called before loadTheme(). The board description gets loaded and is parsed for BBC before $user_info['smiley_set'] is set. That sets the path and it is then cached.

Not sure if this should be considered a bug for all SMF-like installs since you aren't going to see the correct smiley set on the board description. Obviously that's minor but not showing the images is major and I need to figure that out. I just moved the parsing to the only place where I found $board_info['description'] being used. I am guessing there are more but I don't know where.
Title: Re: BBC Parsing
Post by: emanuele on December 18, 2015, 01:54:48 am
The loadBoard before loadTheme is a "feature", because each board can have its own theme, so it has to be loaded first.
The smiley parsing then can be considered a bug because anyway it has been added recently.
At the moment I'm not sure how to fix that.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on December 18, 2015, 07:28:00 am
That's what I mean, a bug that smiley parsing happens before the theme is loaded. I fixed it and it is in a PR.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 11, 2017, 03:29:22 pm
For the longest time I've been talking about using an Abstract Syntax Tree. That involves creating a lexer/parser that creates an array of tokens. Those tokens are then read at runtime. I think what we want is a little different but I'm no expert on lexers.

Here's an example post:
Code: [Select]
Hello Emanuele,

I am writing you to tell you about my new site: [url=https://www.elkarte.net]ElkArte[/url]. I added a lot of features:
[*]create new posts
[*]two-factor authentication
[*]bbc parsing [i]we're going to call this ForumML or [b]FML[/b] :D[/i]

ttyl,
Josh
https://www.github.com/joshuaadickerson

This will create an array (JS object for brevity):
Code: (json) [Select]
[
    {node: 'text', value: 'Hello Emanuele,'},
    {node: 'new line'},
    {node: 'empty line'},
    {node: 'text', value: 'I am writing you to tell you about my new site: '},
    {node: 'tag', value: 'url',
        attr: [{node: 'url', value: 'https://www.elkarte.net'}],
        children: [{node: 'text', value: 'ElkArte'}]
    },
    {node: 'text', value: '. I added a lot of features: '},
    {node: 'new line'},
    // Maybe the parser should make this a new list?
    {node: 'itemcode', value: '*',
        children: [{node: 'text', value: 'create new posts'}]
    },
    {node: 'new line'},
    {node: 'itemcode', value: '*',
        children: [{node: 'text', value: 'two-factor authentication'}]
    },
    {node: 'new line'},
    {node: 'itemcode', value: '*',
        children: [
            {node: 'text', value: 'bbc parsing '},
            {node: 'tag', value: 'i', children: [
                {node: 'text', 'value': 'we\'re going to call this ForumML or '},
                {node: 'tag', value: 'b', children: [{node: 'text', 'value': 'FML'}]},
                {node: 'emoji', 'value': ':D'},
            ]}
        ]
    },
    {node: 'new line'},
    {node: 'empty line'},
    {node: 'text', value: 'ttyl,'},
    {node: 'new line'},
    {node: 'text', value: 'Josh'},
    {node: 'new line'},
    {node: 'tag', value: 'url',
        attr: [{node: 'url', value: 'https://www.github.com/joshuaadickerson'}],
        children: [{node: 'text', value: 'https://www.github.com/joshuaadickerson'}]
    }
]

I'm sure there's a lot that can be done to make that smaller and maybe make it faster, but I did this in this reply window and I'm not an expert. Baby steps.

I'll leave that there for now and work on a formatter for ForumML next.
Title: Re: BBC Parsing
Post by: emanuele on February 11, 2017, 04:23:11 pm
Just for reference, I did a quick google search https://www.google.com/search?q=bbcode+lexer+php
Some code:
http://nbbc.sourceforge.net/
https://github.com/codeconsortium/CCDNComponentBBCode
https://packagist.org/search/?tags=lexer
a presentation:
http://www.slideshare.net/auroraeosrose/lexing-and-parsing
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 11, 2017, 04:48:40 pm
I'm not really sure what's best. There's some lexers out there for markdown that I might look into. Or, I might just skip the research part and adopt what's there now to create a tree.

I think a big change is to make the editor do a lot more. For instance, smileys should be input with a smiley/emoji tag. Then the parser could add/remove that. Something like:
Code: [Select]
// Tag added with the "preparser"
[emoji preparsed=1]:D[/emoji]
// Tag added by the user
[emoji]:D[/emoji]

Same goes for autolink. It would add the url tags around the url when you submit. It would add a flag that represents it was added by the parser so if the autolink was inside of a tag that shouldn't create a link from the url, it would just remove that.

Since this represent a larger size for the messages table, I think the messages should be compressed. Not the AST since that's getting pulled constantly and we don't want to have it use more CPU. The admin would have the option to choose which compression algorithm to choose but what it was actually compressed with would be determined by the message. Then admins can change compression algorithms without having to recompress their entire database at once.

Same would go for the serializer for the AST. It should be intelligent about what method is used to serialize. Most likely one of JSON, igbinary, PHP.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 12, 2017, 10:16:19 am
Having some brain farts...
Code: [Select]
<?php

namespace Elkarte\Messages\Formatters;

class ForumML
{
    const KNOWN_NODE_TYPES = [
        'TEXT'          => 1,
        'NEW_LINE'      => 2,
        'EMPTY_LINE'    => 3,
        'TAG'           => 4,
        'EMOJI'         => 5,
        'LINK'          => 6,
    ];

    public function format($message)
    {
        foreach ($message as $node) {
            if (!$this->isKnownType($node->getType())) {
                throw new InvalidNodeTypeException;
            }

            $this->formatChildren($node);
        }
    }

    public function isKnownType($type)
    {
        $knownTypes = self::KNOWN_NODE_TYPES;
        return isset($knownTypes[$type]);
    }

    protected function formatChildren($node)
    {
        if ($node->hasChildren()) {
            foreach ($node->getChildren() as $childNode) {
                $this->format($childNode);
            }
        }
    }
}

class NodeFactory
{
    public function newNode($node, AbstractNode $parent = null)
    {
        if (!isset($node['type']) || !$this->nodeFactory->isKnownType($node['type'])) {
            throw new InvalidNodeTypeException;
        }

        $newNode = null;

        switch ($node['type']) {
            case NodeFactory::TEXT:
                $newNode = new TextNode($node, $this, $parent);
                break;
        }

        return $newNode;
    }
}

abstract class AbstractNode
{
    protected $type;
    protected $factory;
    protected $parent;
    protected $value;
    protected $children = [];
    protected $attributes = [];

    abstract public function render();
    abstract public function getType();

    /**
     * AbstractNode constructor.
     * @param array $node the unserialized array
     * @param NodeFactory $nodeFactory
     * @param AbstractNode|null $parent
     */
    public function __construct(array $node, NodeFactory $nodeFactory, AbstractNode $parent = null)
    {
        $this->nodeFactory = $nodeFactory;
        $this->type = $this->getType();

        if (isset($node['attributes'])) {
            $this->attributes = $node['attributes'];
        }

        if (isset($node['children'])) {
            $this->setChildren($node['children']);
        }

        if (isset($node['value'])) {
            $this->value = $node['value'];
        }
    }

    public function __toString()
    {
        $this->render();
    }

    public function __sleep()
    {
        return array_filter([
            'type'          => $this->type,
            'value'         => $this->value,
            'attributes'    => $this->attributes,
            'children'      => array_reduce($this->children, function ($children, $child) {
                $children[] = $child->__sleep;
                return $children;
            }, []),
        ]);
    }

    public function hasParent()
    {
        return $this->parent instanceof AbstractNode;
    }

    public function getParent()
    {
        return $this->parent;
    }

    public function hasParentOfType($type)
    {
        return $this->hasParent() && (
            $this->getParent()->getType() === $type
            || $this->getParent()->hasParentOfType($type));
    }

    public function hasChildren()
    {
        return !empty($this->children);
    }

    public function getChildren()
    {
        return $this->children;
    }

    public function hasAttributes()
    {
        return !empty($this->attributes);
    }

    public function getAttributes()
    {
        return $this->attributes;
    }

    protected function setParent(AbstractNode $parent)
    {
        $this->parent = $parent;
        return $this;
    }

    protected function setChildren(array $nodes)
    {
        $this->children = [];

        foreach ($nodes as $node) {
            $this->children[] = $this->factory->newNode($node, $this);
        }

        return $this;
    }
}

// namespace Nodes;
class Tag extends AbstractNode
{
    const TYPE = 'TAG';

    public function getType()
    {
        return self::TYPE;
    }
}
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 12, 2017, 01:05:43 pm
Let me know your thoughts on this.

As it is now, the BBC will always be parsed when a message is accessed. It will always have the most up-to-date settings from BBC. There's nothing to check to make sure that's still true.

There are two ways we can change that to make it better. Both are not perfect.

1) Check the tags on each interpretation.
This means we can not have to know what the BBC is beforehand. We just check for a standard regular expression of (\[[:alphanum:] ). Those alphanumeric characters are checked against the BBC that we have to see if one exists. If so, then we check to see if the attributes/arguments (whatever you want to call them) match the "code" we're testing. This is pretty close to what we're doing now. It saves a lot of string copying and some of the regular expressions that take up a lot of parsing time.

2) Cache since the last time the BBC settings were changed.
This way has the ability to do a lot more optimizations and makes it so you could essentially just cache the entire message barring any user or environmental variables. It would need to be invalidated after every change of the BBC settings. It wouldn't need to run any validation or testing for the BBC because we'd define which class/method is called for each "node" in the AST. When a message is requested, it would check if the time the BBC settings were updated (settingsTime) was >= the time the message was parsed at (parseTime) then it will reparse and insert that into the database. In pseudo: settingsTime >= parseTime ? msg.parse().save();

Perhaps it matters what the interpreters are going to be. I can see a need for 3: web, email, and print. You take the same syntax tree and can display it in different ways depending on the settings.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 12, 2017, 01:16:50 pm
I want to change how filters and validators work. I want to define a stack of validators like 'maxlen[25],minlen[1],alpha' and the same thing with filters 'truncate[25],alpha'

Validators have constraints that are checked. If any constraint returns false, it doesn't match the tag. It could have a warning that could be passed back to the user when they are posting. Validators are checked to see which code matches.

Filters are run after a code matches. They transform (maybe they should be called transformers?) and can be run on any input (attribute or content).

Then to create a new code, it would be simple by stacking together validators and filters/transformers you just name what you want. If you want to create a new one, it would require a new class with a method that matches and you register that (probably through a hook). This would make them DRY and make codes much more composable.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 12, 2017, 01:27:01 pm
Here's the problem with the first one from http://www.elkarte.net/community/index.php?topic=2833.msg30829#msg30829 "Check the tags on each interpretation."

Let's say my message is like this:
Code: [Select]
Hello [bbc I am "[i]writing[/i]" you a message to "[test"]

The AST would be something like:
Code: [Select]
[
    {type: 'text', value: 'Hello '},
    {type: 'tag', value: 'bbc', attr: [
        {type: 'I'},
        {type: 'am', value: '[i]writing[/i]', quoted: 1},
        {type: 'you'},
        {type: 'a'},
        {type: 'message'},
        {type: 'to'},
        {type: '"[test"', quoted: 1},
    ]}
]

It would get to the second node and find out it doesn't have a tag there so it has to go back and reparse all of its children as text. It will get to the [i]writing[/i] part and have to parse that as a tag node. I suspect that's actually worse than what we have now.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 12, 2017, 11:42:08 pm
I think this idea makes sense as a plugin and I think only the second option will work. It would be a huge task to keep it up-to-date, but I like it better that way.

Parsed messages would be in their own table. This way when you update BBC settings, you would just truncate that table. It could be cool to see which messages were seen since the last time you changed your BBC and what they were. Though, that's not the aim at all.

Obviously, this means a lot of changes to queries. Any table that requests the message would need to join this new table and select the parsed message to go with it.
Title: Re: BBC Parsing
Post by: Joshua Dickerson on February 13, 2017, 12:45:07 am
I don't have my BBC repo handy so I'm going off memory. One of the most resource intensive operations is handling itemcodes. I think the preparser should handle the creation of the list.

The preparser is exactly where the lexer should come in. I feel like it would be a lot easier to work with if it were tokens.

playing with my scratch some more: https://gist.github.com/joshuaadickerson/c3669645a6ae59e37a6d46b4efe1bed1