BBC Parsing

BBC Parsing Started by Joshua Dickerson · August 14, 2015, 05:56:08 pm · Read 57279 times 0 Members and 2 Guests are viewing this topic. previous topic - next topic

Re: BBC Parsing

Reply #45 – August 26, 2015, 10:01:01 am

Yea ... a repo !

Re: BBC Parsing

Reply #46 – August 27, 2015, 02:29:00 pm

I did some initial testing on the RegexParser and preg_split() is taking WAY too much time. RegexParser is taking 5x longer and I'm not even doing a fraction of the amount of things the original one does since it's just a proof of concept.

Re: BBC Parsing

Reply #47 – August 27, 2015, 02:37:45 pm

Definitely a no-go, yeah

Re: BBC Parsing

Reply #48 – August 27, 2015, 02:46:18 pm

I think thats how the other parsers I played with worked. The benefit was it was extremely flexible and robust, the downside was the speed was subpar compared to what we have.

I'm still impressed with what you have achieved with the re-write, its only 5% slower then the procedural version and of course its significantly better in terms of structure and looks far less frail to updates..

Re: BBC Parsing

Reply #49 – August 27, 2015, 03:15:52 pm

I'm not happy. I was really hoping the preg_split() parser would be faster. I might try some other regular expressions to see if they work better but I doubt it. I think just creating the regex makes it much slower. A regular expression parser makes the parser so much prettier. One way would be to cache the tokenized version. Then again, why wouldn't you just cache the parsed version at that point?

Which... I think I'll make possible. I'll add an attribute to the BBC "ATTR_NO_CACHE" which, if the message contains a BBC with this attribute, it cannot be cached. The only time this would be set is if you create a BBC that depends on the user's permissions. Then the pre-parser would parse the message and cache it. The parser would have a method Parser::canCache() which would do the same thing.

Still going to do some more things. Going to make the smiley parser, HTML parser, and autolink parser separate classes so you can throw in your own. To reduce the complexity, if they aren't in the constructor it will just create them from default. Going to test and see how making each code an object works. Thinking I'll wind up sticking with an array though.

Re: BBC Parsing

Reply #50 – August 27, 2015, 05:57:48 pm

A lot of the parser deals with issues that the preparser should pick up. I wonder how much and I wonder if removing that will make it any faster. If posts become mangled, we can always run the preparser on them again to fix it. It can be made part of the upgrade or a separate admin action.

Re: BBC Parsing

Reply #51 – August 27, 2015, 06:23:06 pm

What things in the parser do you feel should be part of pre-parse? Its funny how much that function has grown over time.

Re: BBC Parsing

Reply #52 – August 27, 2015, 08:27:40 pm

Not sure yet. Just found out that the messages that we put in there have issues that preparsecode() would have fixed. PhpStorm was saving with \r but I fixed that. Then preparsecode() does a check for [] and replaces that so the parser doesn't deal with it.

Changing the preparser to also use a class. Going to break it down in to smaller pieces and see what can be changed. Already see some low-hanging fruit for performance but that's not the goal here either. The goal is more to make it more manageable and make it easier for mods to extend it.

I need messages to test it with

Re: BBC Parsing

Reply #53 – August 27, 2015, 08:43:38 pm

Yup, speed is not the primary concern with preparse since its generally only called during a save. Of course since it does alter a message you need to be sure it does its job very well or it becomes an oatmeal cartoon.

Just to note what you have in your repo is not the preparse from elkarte, I guess its smf 2.? but its not what Elk currently uses. I think you used parse_bbc from elkare (which again is not the same as 2.?)

Re: BBC Parsing

Reply #54 – August 27, 2015, 10:02:57 pm

Yeah, I messed up and picked the wrong preparsecode(). I changed it locally but haven't pushed it up yet.

Still trying to figure out how to make stupid PhpStorm not save the CRLF

Re: BBC Parsing

Reply #55 – August 28, 2015, 07:30:56 am

Just thinking, be sure to grab the one from 1.0.5, there was an error in the table code and font codes that were fixed.

Posted in the other thread re lfcr vs lf

Re: BBC Parsing

Reply #56 – August 28, 2015, 03:55:05 pm

Quick update. The changes I just pushed are resulting in a regular 8-12% speedup in the new parser. Pretty much just trying to get the autolink method to be called a lot less. I think in a real site with regular messages the new parser would result in much faster parsing overall. The messages that we have try to get a lot of worst case scenarios.

Re: BBC Parsing

Reply #57 – August 28, 2015, 04:59:47 pm

Very cool ...

In the real world most posts are plain text ... so testing with long, medium and longish posts is going to cover 60% of what parse_bbc really does. Then 30% with quotes, and the remaining will be a mixture of bbc with most being the basic b/i/u etc stuff. Yes I pulled those numbers out of my plumbers crack, but I bet they are not that far off.

One area that often abuses bbc is signatures, thats an area that we could consider caching the output as well.

Can always add it to 1.1, I don't think anyone would complain

Re: BBC Parsing

Reply #58 – August 28, 2015, 09:28:47 pm

@Joshua Dickerson
Post this if you need an abused message. Wall of code alert. BBC abuse alert.

EDIT: exceeded 20k characters limit, LOL.

Re: BBC Parsing

Reply #59 – August 28, 2015, 09:44:59 pm

If someone posted a message like that, I'd just wind up deleting their message. Is that a message for the preparser to check? I don't see people posting anything like that all that often so I wouldn't say we need to check for anything like that.

It wouldn't be hard to add a mod in the preparser to do parsebbc() and check the time. If the time is greater than a limit (0.25 seconds?) it should kick it back with an error that there is too much BBC. Though, I'm not looking to waste my own time developing it because I don't see it happen... ever.