Omgili, forum search, forums search, search forums, discussion search,discussions search, search discussions, board search, boards search, search boards
  Advanced Search

Query term counting, again...

On Wed, 25 Jan 2012 15:36:19 -0800 (PST), David Olson <...@proxemx.com

Hi all,

After much code and forum searching, I've hit a frustrating point that
should be more obvious. I've trolled through a ton of postings and messaging
on keyword counting and it seems like all the examples cover single word
terms. I've got several code bits I've written that can get me what I want
from a single term perspective but I have queries with several terms that
also mix in phrases. Ultimately I'd like to have output that says banana - 2
times, "chocolate chips" - 4 times, over a course of 1000+ documents.

Right now I walk through the query terms and match against the term vectors
from my hits. This, of course, makes the assumption chocolate and chips are
separate terms. Comparing positions seems like the only way.

The frustrating point is that I see the 2 query types in the clauses for the
query. And, more annoying is that explain() does show what I need and I
haven't had a lot of luck backtracking what it's doing. Spans didn't seem to
help either.

Any advice? I'm getting real good a single term counting :)

-DO

--
View this message in context: http://lucene.472066.n3.nabble.com/Query-term-counting-again-tp3689354p3689354. html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org



On Thu, 26 Jan 2012 08:44:11 -0500, Michael McCandless <...@mikemccandless.com

You should be able to use the Scorer.visitSubScorers API? You'd do
this up front, to recursively gather all "interesting" scorers in the
Query, and then in a custom collector, in the collect method, you can
go and ask each subScorer whether it matched the current document
(call its .freq() and see if that is
This is very expert territory and not well explored... and there are
certain cases where it will fail because of how boolean scorers
work... but it should otherwise work and scale well.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 25, 2012 at 6:36 PM, David Olson <...@proxemx.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

On Thu, 26 Jan 2012 09:31:10 -0500, "David Olson" <...@graniterabbit.com

Thanks Mike - I spent a few hours tracing through the explain process last
night and could see all that and it looked like most was reachable without
having to alter core classes. The other thing I thought of since I'm doing
this as a one-time shot as messages come in (persisting aggregate counts) I
could segregate the term queries from the phrase queries and have a more
predictable collection of scorers. But then I might as well do an individual
search for each keyword. That seems a bit off too.

The basis of this function is to have near real-time performance of keywords
from incoming messages. Then we use those numbers for targeting. I index the
messages as they come in and then we can use all the great Lucene stuff for
searching and analysis after the fact. It's just the term/phrase thing
that's been frustrating me and I refuse to parse the output of explain. Just
something about that doesn't sit right. With a hundred vendors that could
have 30 keywords each, ouch.

Thanks again!

-David-

-----Original Message-----
From: Michael McCandless [mai...@mikemccandless.com]
Sent: Thursday, January 26, 2012 8:44 AM
To: java...@lucene.apache.org
Subject: Re: Query term counting, again...

You should be able to use the Scorer.visitSubScorers API? You'd do this up
front, to recursively gather all "interesting" scorers in the Query, and
then in a custom collector, in the collect method, you can go and ask each
subScorer whether it matched the current document (call its .freq() and see
if that is
This is very expert territory and not well explored... and there are certain
cases where it will fail because of how boolean scorers work... but it
should otherwise work and scale well.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 25, 2012 at 6:36 PM, David Olson <...@proxemx.coma course of 1000+ documents.

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

On Thu, 26 Jan 2012 15:13:41 +0100, "Uwe Schindler" <...@thetaphi.de

You have to take care that BooleanScorer2 is used, by requesting
docsInOrder. Then its very nice, I have a customer using this. The important
thing is that your Collector returns the right thing :-)

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uw...@thetaphi.de

up
then in a
is certain cases
otherwise
over a
way.

---------------------------------------------------------------------
To unsubscribe, e-mail: java...@lucene.apache.org
For additional commands, e-mail: java...@lucene.apache.org

Discussion Title: Query term counting, again...
Title Keywords: Query  term  counting,  again...