Skip to main content
Topic: Sphinx QL code (Read 219 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Sphinx QL code

I was reviewing some items in Sphinx QL code and noticed a couple of items which I think can be improved.

Specifically the topic that the post is in is a very relevant overall topic, but the specific post in that topic that gets shown in the results is incorrect.    There are two options here, one is to simply show and link to the first post in that topic, or link and show the most relevant post inside of that topic.

Next up is the ranker that we use, currently is PROXIMITY_BM25 , if you don't know what that is or don't use sphinx then don't worry about it.  I've been playing with some improved ranker equations, based off SPH04 but using bm25F values and high ranking for exact or subject matches.

So if you use sphinx and want to try a couple of things to see what you get, let me know.

Thanks in advance!!!

 

Re: Sphinx QL code

Reply #1

I use sphinx on a couple of sites, so have tinkered with it as well.

In the 2.0 branch I attempted to make several adjustments to ranker and the query, attempting to get more relevant results.  One of the items it tries to weight is the reply number in a topic, giving newer replies more weight. 

The current ideasI have in the 2.0 code are
Code: (query) [Select]
		SELECT \
m.id_msg, m.id_topic, m.id_board, CASE WHEN m.id_member = 0 THEN 4294967295 ELSE m.id_member END AS id_member, \
m.poster_time, m.body, m.subject, t.num_replies + 1 AS num_replies, t.num_likes, t.is_sticky, \
1 - ((m.id_msg - t.id_first_msg) / (t.id_last_msg - t.id_first_msg)) AS position, \
CEILING(10 * ( \
CASE WHEN m.id_msg < 0.6 * s.value THEN 0 ELSE (m.id_msg - 0.6 * s.value) / (0.4 * s.value) END * 25 + \
CASE WHEN t.num_replies < 50 THEN t.num_replies / 50 ELSE 1 END * 20 + \
CASE WHEN m.id_msg = t.id_first_msg THEN 1 ELSE 0 END * 10 + \
CASE WHEN t.num_likes < 20 THEN t.num_likes / 20 ELSE 1 END * 0 + \
CASE WHEN t.is_sticky = 0 THEN 0 ELSE 1 END * 0 \
) * 100/55) AS acprel \

the CASE in that query are:
1) old topics < weight
2) topics with more replies > weight
3) first message in topic  > weight
4) likes on topic (aka first message in topic) > weight
5) sticky topic > weight
that weight/value is given a overall influence of 100/55 and returned to Sphinx as acprel.  The 55 is the total to the weights in those CASE values used in that query.  I think that should provide a 0-1000 value

In the Select there is also a calculated position which is the relative position in the topic of a given message, that value is returned to the Sphinx in position, its value will be 1 to 0 (although it really can't be zero)

Then the ranker it uses the following equation utilizing the acprel and position values returned from the query
Code: [Select]
ranker=expr(sum((4*lcs+2*(min_hit_pos==1)+word_count)*user_weight*position) + acprel + bm25')
which is a modification of SPH_RANK_SPH04 which I believe is
Code: [Select]
SPH_RANK_SPH04 sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25
So it uses the position marker weight in the first term (which if I read correctly will also be 0-1000) which is adjusted by our position value of 1-0 in value, (earlier replies in the topic get more weight). 
Next it adds in the modifiers (those are the options you can set in the ACP)
lastly the bm25 value.

There are field weights applied to subject / body hits as well (as defined in the ACP)

Very interested in what you have tried, the sphinx documentation often leaves me confused as you have to bounce around between versions and releases to get to the information.  TBH I've not looked at this in some time and I'm certainly not saying this is the best idea or even a good one.