Skip to content

Commit

Permalink
src/dict/dictionary: better filter mecab and deconjugator duplicates
Browse files Browse the repository at this point in the history
Better filter queries that are produced by both the deconjugator and
MeCab. Previously MeCab results were generally not filtered if they were
duplicates of deconjugator results. This would lead to duplicate search
results where some would have conjugation information and some would
not. This is generally undesirable. This changes the behavior to filter
out MeCab results that are duplicates of Deconjugator results when
compiled with MECAB_SUPPORT.

The filtering algorithm is not trivial. This required adding metadata
indicating the source alogrithm that each came from. If this had not
been done, it would have been difficult to discriminate MeCab queries
from Exact queries. The algorithm consists of finding every deconj
string that the deconjugator came up with. Then it tosses any MeCab
queries that have a deconj string that the deconjugator already found.
This is unforunteately not behavior that can be implemented with
std::unqiue easily.
  • Loading branch information
ripose-jp committed Jul 2, 2024
1 parent 9cca61a commit a3a010d
Show file tree
Hide file tree
Showing 6 changed files with 50 additions and 2 deletions.
3 changes: 2 additions & 1 deletion src/dict/deconjugationquerygenerator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,11 @@ std::vector<SearchQuery> DeconjugationQueryGenerator::generateQueries(
else
{
result.emplace_back(SearchQuery{
SearchQuery::Source::deconj,
info.base,
info.conjugated,
{ rule },
info.derivationDisplay
info.derivationDisplay,
});
}

Expand Down
32 changes: 32 additions & 0 deletions src/dict/dictionary.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,38 @@ void Dictionary::sortQueries(std::vector<SearchQuery> &queries)

void Dictionary::filterDuplicates(std::vector<SearchQuery> &queries)
{
#ifdef MECAB_SUPPORT
/* Remove all duplicates that both MeCab and the deconjugator got.
* Prefer deconjugator results over MeCab. */
QSet<QString> deconjQueries;
for (const SearchQuery &query : queries)
{
if (query.source == SearchQuery::Source::deconj)
{
deconjQueries.insert(query.deconj);
}
}
queries.erase(
std::remove_if(
std::begin(queries), std::end(queries),
[&deconjQueries] (const SearchQuery &query) -> bool
{
switch (query.source)
{
case SearchQuery::Source::mecab:
return deconjQueries.contains(query.deconj);

case SearchQuery::Source::deconj:
case SearchQuery::Source::exact:
return false;
}
return false;
}
),
std::end(queries)
);
#endif // MECAB_SUPPORT

auto last = std::unique(
std::begin(queries), std::end(queries),
[] (const SearchQuery &lhs, const SearchQuery &rhs) -> bool
Expand Down
1 change: 1 addition & 0 deletions src/dict/exactquerygenerator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ std::vector<SearchQuery> ExactQueryGenerator::generateQueries(
SearchQuery sq;
sq.deconj = query;
sq.surface = query;
sq.source = SearchQuery::Source::exact;
queries.emplace_back(std::move(sq));
query.chop(1);
}
Expand Down
1 change: 1 addition & 0 deletions src/dict/mecabquerygenerator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ MeCabQueryGenerator::generateQueriesHelper(const MeCab::Node *node)
query.deconj = deconj;
query.surface = surface;
query.surfaceClean = surfaceClean;
query.source = SearchQuery::Source::mecab;
queries.emplace_back(std::move(query));
}

Expand Down
13 changes: 13 additions & 0 deletions src/dict/searchquery.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,19 @@
*/
struct SearchQuery
{
/**
* Enumeration of SearchQuery sources.
*/
enum class Source
{
exact,
deconj,
mecab,
};

/* The query algorithm this query comes from */
Source source;

/* The deconjugated string */
QString deconj;

Expand Down
2 changes: 1 addition & 1 deletion src/util/constants.h
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ namespace Constants

#ifdef MECAB_SUPPORT
constexpr const char *MECAB_IPADIC = "ipadic-matcher";
constexpr bool MECAB_IPADIC_DEFAULT = false;
constexpr bool MECAB_IPADIC_DEFAULT = true;
#endif // MECAB_SUPPORT
}

Expand Down

0 comments on commit a3a010d

Please sign in to comment.