text-embedding

Ollama, and hence rollama, can be used for text embedding. In short, text embedding uses the knowledge of the meaning of words inferred from the context that is saved in a large language model through its training to turn text into meaningful vectors of numbers. This technique is a powerful preprocessing step for supervised machine learning and often increases the performance of a classification model substantially. Compared to using rollama directly for classification, the advantage is that converting text into embeddings and then using these embeddings for classification is usually faster and more resource efficient – especially if you re-use embeddings for multiple tasks.

library(rollama)
library(tidyverse)

reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
                       show_col_types = FALSE)
glimpse(reviews_df)
#> Rows: 23,486
#> Columns: 11
#> $ ...1                      [3m[38;5;246m<dbl>[39m[23m 0, 1, 2, 3, 4, 5, 6,…
#> $ `Clothing ID`             [3m[38;5;246m<dbl>[39m[23m 767, 1080, 1077, 104…
#> $ Age                       [3m[38;5;246m<dbl>[39m[23m 33, 34, 60, 50, 47, …
#> $ Title                     [3m[38;5;246m<chr>[39m[23m NA, NA, "Some major …
#> $ `Review Text`             [3m[38;5;246m<chr>[39m[23m "Absolutely wonderfu…
#> $ Rating                    [3m[38;5;246m<dbl>[39m[23m 4, 5, 3, 5, 5, 2, 5,…
#> $ `Recommended IND`         [3m[38;5;246m<dbl>[39m[23m 1, 1, 0, 1, 1, 0, 1,…
#> $ `Positive Feedback Count` [3m[38;5;246m<dbl>[39m[23m 0, 4, 0, 0, 6, 4, 1,…
#> $ `Division Name`           [3m[38;5;246m<chr>[39m[23m "Initmates", "Genera…
#> $ `Department Name`         [3m[38;5;246m<chr>[39m[23m "Intimate", "Dresses…
#> $ `Class Name`              [3m[38;5;246m<chr>[39m[23m "Intimates", "Dresse…

Now this is a rather big dataset, and I don’t want to stress my GPU too much, so I only select the first 500 reviews for embedding. I also process the data slightly by combining the title and review text into a single column and turning the rating into a binary variable:

reviews <- reviews_df |>
  slice_head(n = 500) |>
  rename(id = ...1) |>
  mutate(rating = factor(Rating == 5, c(TRUE, FALSE), c("5", "<5"))) |>
  mutate(full_text = paste0(ifelse(is.na(Title), "", Title), `Review Text`))

embed_text(text = reviews$full_text[1:3])
#> ✔ embedded 3 texts [4s] 
#> # A tibble: 3 × 4,096
#>   dim_1  dim_2 dim_3  dim_4 dim_5  dim_6  dim_7  dim_8 dim_9 dim_10
#>   <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
#> 1  1.85 -1.71   1.47  0.478 -1.75  0.771  3.01   0.961 1.65   0.569
#> 2  1.14 -3.61   2.10 -0.385 -4.11 -3.09   0.990 -1.06  2.55   1.84 
#> 3 -3.35  0.172 -3.49 -0.569 -3.14  1.25  -0.102  1.15  0.575 -2.33 
#> # ℹ 4,086 more variables: dim_11 <dbl>, dim_12 <dbl>,
#> #   dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>,
#> #   dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>,
#> #   dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> #   dim_25 <dbl>, dim_26 <dbl>, dim_27 <dbl>, dim_28 <dbl>,
#> #   dim_29 <dbl>, dim_30 <dbl>, dim_31 <dbl>, dim_32 <dbl>,
#> #   dim_33 <dbl>, dim_34 <dbl>, dim_35 <dbl>, dim_36 <dbl>, …

To use this on the sample of reviews, I put the embeddings into a new column, before unnesting the resulting data.frame. The reason behind this is that I want to make sure the embeddings belong to the correct review ID. I also use a different model this time: nomic-embed-text. While models like llama3 are extremely powerful at handling conversations and natural language requests, they are also computationally intensive, and hence relatively slow. As of version 0.1.26, Ollama support using dedicated embedding models, which can perform the task a lot faster and with fewer resources. Download the model with pull_model("nomic-embed-text") then we can run:

reviews_embeddings <- reviews |>
  mutate(embeddings = embed_text(text = full_text, model = "nomic-embed-text")) |>
  select(id, rating, embeddings) |>
  unnest_wider(embeddings)
#> ✔ embedded 500 texts [7m 19.4s]

The resulting data.frame contains the ID and rating along the 768 embedding dimensions:

reviews_embeddings
#> # A tibble: 500 × 770
#>       id rating   dim_1 dim_2 dim_3  dim_4   dim_5 dim_6
#>    <dbl> <fct>    <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>
#>  1     0 <5      1.23   1.66  -3.35 -0.161 -0.0393 0.113
#>  2     1 5       0.881  0.696 -2.34 -0.798 -1.56   1.53 
#>  3     2 <5      0.507  1.03  -2.10 -0.398 -0.822  1.78 
#>  4     3 5      -0.253  0.866 -3.12 -0.225 -0.562  1.14 
#>  5     4 5       0.308  1.05  -2.80 -1.20  -0.466  0.477
#>  6     5 <5      0.860  0.894 -2.29 -0.642 -1.38   2.24 
#>  7     6 5       0.573  0.257 -2.26 -0.681 -0.592  0.461
#>  8     7 <5      0.199  0.628 -2.33 -0.374 -1.51   0.174
#>  9     8 5      -0.124  1.19  -3.17 -0.267 -0.705  0.683
#> 10     9 5      -0.0137 1.05  -2.54 -0.214 -1.48   0.721
#> # ℹ 490 more rows
#> # ℹ 762 more variables: dim_7 <dbl>, dim_8 <dbl>,
#> #   dim_9 <dbl>, dim_10 <dbl>, dim_11 <dbl>,
#> #   dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> #   dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> #   dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>,
#> #   dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, …

As said above, these embeddings are often used in supervised machine learning. I use part of a blog post by Emil Hvitfeldt show how this can be done using the data we embedded above in the powerful tidymodels collection of packages:

library(tidymodels)
# split data into training an test set (for validation)
set.seed(1)
reviews_split <- initial_split(reviews_embeddings)

reviews_train <- training(reviews_split)

# set up the model we want to use
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

# we specify that we want to do some hyperparameter tuning and bootstrapping
param_grid <- grid_regular(penalty(), levels = 50)
reviews_boot <- bootstraps(reviews_train, times = 10)

# and we define the model. Here we use the embeddings to predict the rating
rec_spec <- recipe(rating ~ ., data = select(reviews_train, -id))

# bringing this together in a workflow
wf_fh <- workflow() |>
  add_recipe(rec_spec) |>
  add_model(lasso_spec)

# now we do the tuning
set.seed(42)
lasso_grid <- tune_grid(
  wf_fh,
  resamples = reviews_boot,
  grid = param_grid
)

# select the best model
wf_fh_final <- wf_fh |>
  finalize_workflow(parameters = select_best(lasso_grid, "roc_auc"))

# and train a new model + predict the classes for the test set
final_res <- last_fit(wf_fh_final, reviews_split)

# we extract these predictions
final_pred <- final_res |>
  collect_predictions()

# and evaluate them with a few standard metrics
my_metrics <- metric_set(accuracy, precision, recall, f_meas)

my_metrics(final_pred, truth = rating, estimate = .pred_class)
#> # A tibble: 4 × 3
#>   .metric   .estimator .estimate
#>   <chr>     <chr>          <dbl>
#> 1 accuracy  binary         0.752
#> 2 precision binary         0.694
#> 3 recall    binary         0.847
#> 4 f_meas    binary         0.763

# and the ROC curve
final_pred |>
  roc_curve(rating, .pred_5) |>
  autoplot()