BBC Parsing

Topic: BBC Parsing (Read 39720 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: BBC Parsing

Reply #165 – December 14, 2015, 06:30:57 pm

Yeah, I realized that. More commits followed. I forgot the autolink one which I'm not sure if I committed.

Re: BBC Parsing

Reply #166 – December 15, 2015, 04:53:44 pm

On my 1.1 install (which may be at fault here) smileys do not parse in a message.

In a quick look it seems its because $user_info['smiley_set'] is set to '' when you are viewing a topic (loadtheme sets that value) ... so the parser does work, but the image tags are wrong (they lack the smiley sub directory, like default).

I'm still getting used to the new parser, and I'm not sure if this is because its instance is created before loadtheme is called or something else is going on? Need some input form the parser master.

Also in subs where old parse_bbc has been hotwired, there is a return statement which leaves some unreachable code, again could use some input on that.

Re: BBC Parsing

Reply #167 – December 15, 2015, 10:55:55 pm

Smilies: I need to take out all globals from these things. They make my life so difficult. They should still parse but the directory doesn't appear to be correct. I don't know how the order of when it is getting the user's theme has changed at all though. From what I can tell - looking at the call tree - loadTheme() is called before parse() which is the only thing that should be setting up the smiley parser at this point. So, I have no idea how that is getting screwed up. I'm looking in to it but I might need some help there.

parse_bbc: I need to clean up the commits done to Elkarte repo. There is a lot of crap that I left that I shouldn't have. Bad programming on my part but I wanted to get it PR'd so we could start testing for beta.

Re: BBC Parsing

Reply #168 – December 16, 2015, 09:02:14 am

Not sure on why that value is failing to load in some paths, its very odd ... board index has it loaded, but not when you go to view a topic. Well its loaded, its just empty instead of = 'default' or whatever. Your pending PR is likely what needs to be done, that's how I hotwired my local to render them. I'll poke around a bit more and see if I can figure it out.

nods on the rest, thats why I'm starting to poke around at bit (and thanks for answering my question in your repo, that was the point I was thinking about when I saw that var)

Re: BBC Parsing

Reply #169 – December 16, 2015, 05:12:59 pm

You're right, I left parse_bbc() in there for legacy. I'm thinking I should just remove it altogether since you already need to change your codes to make it work. At least this way it will cause an error if you didn't change them. Then you won't have a mystery bug in your mod. What do you think?

Re: BBC Parsing

Reply #170 – December 16, 2015, 06:27:09 pm

Maybe we should just depreciate it in 1.1 and then removed in 1.2 or 2.0, whichever is next, just to be proper.

Re: BBC Parsing

Reply #171 – December 16, 2015, 11:48:39 pm

In the latest commit I added content tracking. This is inspired by footnotes but I wanted it to be much more versatile. It works for any type that can have content. If I'm not mistaken, that only leaves item codes and closed tags without this ability. It only gets the content though. If your code has before/after, it only tracks the end of the before and the start of the after as the content area.

Right now, all it is doing is tracking where that content is. If you go back in position and change things, it won't have that information. The ability to capture it is there, but I don't really want you to do that since it poses a huge memory consumption issue if abused. There is also no feature to add the footnote link with the count.

So, how would you do that? You get the tracked footnote content using $parser->getTrackedContent('footnote')... easier to explain with code...

Code: [Select]

<?php

$message = $parser->parse($message);

$list = '<ol class="footnotes">';

$pos_padding = 0;

foreach ($parser->getTrackedContent('footnote') as $i => $code)
{
	$start = $pos_padding + $code[Codes::TRACKED_CONTENT]['start'];
	$end = $pos_padding + isset($code[Codes::TRACKED_CONTENT]['end']) ? $code[Codes::TRACKED_CONTENT]['end'] : strlen($message);
	
	$content = substr($message, $start, $end - $start);
	$list .= '<li id="footnote' . $i . '>' . $content . '</li>';
	
	$insert = '<sup class="bbc_footnotes">' . $i . '</sup>';
	$message = substr_replace($message, $insert, $start, 0);
	$pos_padding += strlen($insert);
}

$list .= '</ul>';

I don't think I'm happy with this solution, but it shows you how much more versatile tracked content is than just a footnotes tag. What I'd like to do is have a variable that you can use in before/after/content which is turned on using tag count tracking. Then inject the count so you are doing it in the parser, utilizing what's available. Or maybe change before/after/content to be a string OR a closure. So many ways to skin this cat.

Re: BBC Parsing

Reply #172 – December 17, 2015, 08:38:19 am

Kind of cool ...

Sounds like those location points are static breadcrumbs so to speak? Are they established based on the passed bbc message (marking the bbc tags content start and end ? so [bla]X.....Y[/bla] where X and Y are the strpos numerics? If so then in your example you should not be placing them back into $message or after the first substitution, they are all wrong, unless I'm missing something, like my morning caffeine !

Yea some kind of counter would be nice, but I guess that can be done in that loop as well?

Re: BBC Parsing

Reply #173 – December 17, 2015, 07:06:06 pm

It's not tracking the BBC. It is tracking the output. If it tracked BBC, it would need to effectively parse the message twice. First time it would just find all of the codes. If you wanted to make changes it would do it then. The second time is when you'd parse any changes.

I think I want to change the option to "TRACK" and then have track: count, content, captured_content, codes (all of the codes that are found), params (must also track the codes but it will get the parameter key/vals), equals (same as params but just equals), changes (any change that is made in a log format: "found $tag, found $code. inserted after at $pos."). Maybe I'll do all of that anyway and just use the $code array instead of variables like $data.

This is really just leaving us with one path - have a parse tree. Create the tree and then iterate the nodes which will tell us how to parse it.

Re: BBC Parsing

Reply #174 – December 18, 2015, 12:18:04 am

Figured out why it isn't putting the correct smiley set in.

loadBoard() gets called before loadTheme(). The board description gets loaded and is parsed for BBC before $user_info['smiley_set'] is set. That sets the path and it is then cached.

Not sure if this should be considered a bug for all SMF-like installs since you aren't going to see the correct smiley set on the board description. Obviously that's minor but not showing the images is major and I need to figure that out. I just moved the parsing to the only place where I found $board_info['description'] being used. I am guessing there are more but I don't know where.

Re: BBC Parsing

Reply #175 – December 18, 2015, 01:54:48 am

The loadBoard before loadTheme is a "feature", because each board can have its own theme, so it has to be loaded first.
The smiley parsing then can be considered a bug because anyway it has been added recently.
At the moment I'm not sure how to fix that.

Re: BBC Parsing

Reply #176 – December 18, 2015, 07:28:00 am

That's what I mean, a bug that smiley parsing happens before the theme is loaded. I fixed it and it is in a PR.

Re: BBC Parsing

Reply #177 – February 11, 2017, 03:29:22 pm

For the longest time I've been talking about using an Abstract Syntax Tree. That involves creating a lexer/parser that creates an array of tokens. Those tokens are then read at runtime. I think what we want is a little different but I'm no expert on lexers.

Here's an example post:

Code: [Select]

Hello Emanuele,

I am writing you to tell you about my new site: [url=https://www.elkarte.net]ElkArte[/url]. I added a lot of features:
[*]create new posts
[*]two-factor authentication
[*]bbc parsing [i]we're going to call this ForumML or [b]FML[/b] :D[/i]

ttyl,
Josh
https://www.github.com/joshuaadickerson

This will create an array (JS object for brevity):

Code: (json) [Select]

[
    {node: 'text', value: 'Hello Emanuele,'},
    {node: 'new line'},
    {node: 'empty line'},
    {node: 'text', value: 'I am writing you to tell you about my new site: '},
    {node: 'tag', value: 'url',
        attr: [{node: 'url', value: 'https://www.elkarte.net'}],
        children: [{node: 'text', value: 'ElkArte'}]
    },
    {node: 'text', value: '. I added a lot of features: '},
    {node: 'new line'},
    // Maybe the parser should make this a new list?
    {node: 'itemcode', value: '*',
        children: [{node: 'text', value: 'create new posts'}]
    },
    {node: 'new line'},
    {node: 'itemcode', value: '*',
        children: [{node: 'text', value: 'two-factor authentication'}]
    },
    {node: 'new line'},
    {node: 'itemcode', value: '*',
        children: [
            {node: 'text', value: 'bbc parsing '},
            {node: 'tag', value: 'i', children: [
                {node: 'text', 'value': 'we\'re going to call this ForumML or '},
                {node: 'tag', value: 'b', children: [{node: 'text', 'value': 'FML'}]},
                {node: 'emoji', 'value': ':D'},
            ]}
        ]
    },
    {node: 'new line'},
    {node: 'empty line'},
    {node: 'text', value: 'ttyl,'},
    {node: 'new line'},
    {node: 'text', value: 'Josh'},
    {node: 'new line'},
    {node: 'tag', value: 'url',
        attr: [{node: 'url', value: 'https://www.github.com/joshuaadickerson'}],
        children: [{node: 'text', value: 'https://www.github.com/joshuaadickerson'}]
    }
]

I'm sure there's a lot that can be done to make that smaller and maybe make it faster, but I did this in this reply window and I'm not an expert. Baby steps.

I'll leave that there for now and work on a formatter for ForumML next.

Re: BBC Parsing

Reply #178 – February 11, 2017, 04:23:11 pm

Just for reference, I did a quick google search https://www.google.com/search?q=bbcode+lexer+php
Some code:
http://nbbc.sourceforge.net/
https://github.com/codeconsortium/CCDNComponentBBCode
https://packagist.org/search/?tags=lexer
a presentation:
http://www.slideshare.net/auroraeosrose/lexing-and-parsing

Re: BBC Parsing

Reply #179 – February 11, 2017, 04:48:40 pm

I'm not really sure what's best. There's some lexers out there for markdown that I might look into. Or, I might just skip the research part and adopt what's there now to create a tree.

I think a big change is to make the editor do a lot more. For instance, smileys should be input with a smiley/emoji tag. Then the parser could add/remove that. Something like:

Code: [Select]

// Tag added with the "preparser"
[emoji preparsed=1]:D[/emoji]
// Tag added by the user
[emoji]:D[/emoji]

Same goes for autolink. It would add the url tags around the url when you submit. It would add a flag that represents it was added by the parser so if the autolink was inside of a tag that shouldn't create a link from the url, it would just remove that.

Since this represent a larger size for the messages table, I think the messages should be compressed. Not the AST since that's getting pulled constantly and we don't want to have it use more CPU. The admin would have the option to choose which compression algorithm to choose but what it was actually compressed with would be determined by the message. Then admins can change compression algorithms without having to recompress their entire database at once.

Same would go for the serializer for the AST. It should be intelligent about what method is used to serialize. Most likely one of JSON, igbinary, PHP.