Skip to content

Pass weights to jaccard_string_group() (or speed up the function) #126

@aidanhorn

Description

@aidanhorn

Is your feature request related to a problem? Please describe.
jaccard_string_group() takes too long on 25 million rows with about 1000 dirty categories, paring down to about 200 clean categories. But, it can process the unique dirty string vector within minutes. However, jaccard_string_group() does not pass through a weights vector to cluster_fast_greedy(), so all the dirty strings in the unique vector would have an equal weight.

Describe the solution you'd like
Please include an option to pass weights to jaccard_string_group().

Describe alternatives you've considered
I have copied the function and tried to include this option, but I do not have Rust installed and I'm not sure how to compile everything using Rust.

Additional context

jaccard_string_group <- function(   string,
                                    n_gram_width = 2,
                                    n_bands = 45,
                                    band_width = 8,
                                    threshold = .7,
                                    progress = TRUE,
                                    cluster_weights = NULL) {
  if (!requireNamespace("igraph")) {
    stop("library 'igraph' must be installed to run this function")
  }

  pairs <- rust_jaccard_join(string,
    string,
    ngram_width = n_gram_width,
    n_bands,
    band_size = band_width,
    threshold = threshold,
    progress = progress,
    seed = round(stats::runif(1, 0, 2^64))
  )

  graph <- igraph::graph_from_edgelist(pairs)
  if (packageVersion("igraph") < "2.0.0") {
    fc <- igraph::fastgreedy.community(igraph::as.undirected(graph))
  } else {
    fc <- igraph::cluster_fast_greedy(igraph::as.undirected(graph), weights=cluster_weights)
  }
  groups <- igraph::groups(fc)
  lookup_table <- vapply(groups, "[[", integer(1), 1)
  membership <- igraph::membership(fc)
  return(string[lookup_table[membership]])
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions