Omgili, forum search, forums search, search forums, discussion search,discussions search, search discussions, board search, boards search, search boards
  Advanced Search

Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

On Tue, 31 Jan 2012 12:48:02 -0800, Paul Allan Hill <...@metajure.com

In Lucene, 3.4 I recently implemented "Translating PhraseQuery to SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to matter.

Here is my exact code called from getFieldsQuery once I know I'm looking at a PhraseQuery, but I think it is exactly from the book.

static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) {
Term[] terms = phraseQ.getTerms();
SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
for (int i = 0; i < terms.length; i++) {
clauses[i] = new SpanTermQuery(terms[i]);
}
SpanNearQuery query = new SpanNearQuery(clauses, slop, PHRASE_ORDER_MATTERS);
return query;
}

I put in my own QueryParser and things looked good until I try a phrase with stop words.
Using the old PhraseQuery I got results on a phrase with stop words without extending the slop, but with SpanNearQuery unless the query includes some slop, nothing is found.
This conflicts with the typical use case of a user taking a phrase, pasting into the search bar with quotes and expecting to find his document.
I can't just add some more slop, because it depends on how many stop words are in any sequence in the phrase.

Any suggestions on how to solve the problem of combining the idea of SpanNear (so that words in order in a phrase is better) with text that has stop words removed, so that I can to support the simple use of quotes for exact quoted text matching?

Any Ideas?

-Paul



On Wed, 1 Feb 2012 09:30:52 +0200, Doron Cohen <...@gmail.com

Hi,

Code here ignores PhraseQuery (PQ) 's positions:

int[] pp = PQ.getPositions();

These positions have extra gaps when stop words are removed.

To accommodate for this, the overall extra gap can be added to the slope:
int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- boundary
cases)
slope += gap;

I think this is less accurate than PQ:
It does not specify the exact position of the stop word.

For example, assume original text:
A B S D
and S is a stop word.

PQ:
A B S D would match
A S B D would not

Span Near query: both would match.

Perhaps there's a way around this too that I am not aware of.

Also, this code suggestion simplifies in the case that the analyzer in
effect may emit more than one term at the same position - for example when
expanding the query with synonyms, or when keeping originals and stemmed
forms - in that case just comparing pp[0] and pp[pp.length-1] is
insufficient, and the positions should be examined while looping the phrase
terms, something like this:

int dpos = pp[i+1] - p[i]; // (i if (dpos slope += (dpos -1);

Haven't tested this - just to give you an idea what to try next.

Doron

On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill <...@metajure.com

On Wed, 1 Feb 2012 11:04:31 -0800, Paul Allan Hill <...@metajure.com

Thanks for the discussion, I really appreciate you pointing out that the

And by "here" you mean my original code not your suggestion.

At 1st I was thinking my refinement of this would be to consider the original slop provided by the user and only extend it when necessary.
For example:
"The Importance of Being Earnest"~2
Already has enough slop to take into consideration the stop words 'the' and 'of', so no need to just add more to the slop.
But a slop of 2 really means the user would accept.
[The Importance of Really Truly Being Earnest] but I see that requires a slop of 3 to skip [of] [Really] [Truly]

But I'm not sure if I understand the 'edit distance' for a phrase with more than 2 words. Does it apply to _all_the_edits_combined to bring the quoted phrase to match the index phrase as suggested by your calculation?

Also, do any "boundary cases" (as mentioned in your comment) come to mind?

I don't understand what you mean that it simplifies, since you already listed the simplification in your first example which I think would work in cases with or without synonyms, so no need to walk through each distance as shown in your later code.

Thanks for your input, I will experiment with some code that considers the original PQ positions when considering the slop value of any generated SpanNearQuery.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

On Wed, 1 Feb 2012 13:37:17 -0800, Paul Allan Hill <...@metajure.com

int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1);

Don't want to cause an IndexOutOfBoundsException
-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

On Wed, 1 Feb 2012 23:44:35 +0200, Doron Cohen <...@gmail.com

Right... that's what I meant with "(boundary cases)"...