1 Backdrop

[Frodo, 2:37 PM] I’m working on finding representative points for census tracts using nationwide residential building data, weighted by the population of each tract.
[Frodo, 2:38 PM] But DBSCAN is taking forever lol, so I’m running it in parallel.
[Frodo, 2:38 PM] Using something like the future package in R.
[Yong-Hun Suh, 2:41 PM] Yeah, DBSCAN’s time complexity isn’t great…
[Yong-Hun Suh, 4:47 PM] Are you still working on that (DBSCAN)?
[Yong-Hun Suh, 4:47 PM] If it’s a project you’re going to continue, I can give you a tip…
[Frodo, 4:53 PM] Just fixed the code and running it now haha, results should come out in a few days if it’s fast.

I was chatting with my colleague from grad school who said DBSCAN was running too slowly on their project.

He had already been using a high speed C++ implementation of DBSCAN in R. Still, however, it was giving an eternal waiting.

I thought, ‘Why not help him out and turn this into a blog article?’

1.1 Analyzing complexity factors

What is Time Complexity?

In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. In the plot below, \(n\) represents the size of the input, and \(N\) represents the number of operations the algorithm performs. This relationship is a critical factor in defining the algorithm’s performance, as the efficiency of the algorithm can be assessed by how \(N\) changes as the input size \(n\) increases.

Learn more

Code

library(data.table)
library(ggplot2)

# Seq `n` gen
n <- seq(0, 100, by = 0.01)  # need to make the increment small in order to avoid `Inf`s

# data.table
df <- data.table(
                  n = n,
                  `O(1)` = 1,
                  `O(log n)` = log2(n),
                  `O(n)` = n,
                  `O(n log n)` = n * log2(n),
                  `O(n^2)` = n^2,
                  `O(2^n)` = 2^n,
                  `O(n!)` = factorial(n)
                )

df_long <- data.table::melt(df, id.vars = "n", variable.name = "Complexity", value.name = "Time")

ggplot(df_long, aes(x = n, y = Time, color = Complexity)) +
      geom_line(size = 1.2) +
            ylim(1, 100) +
            xlim(1, 100)+
      labs(
            #title = "Compirison of Computational Complexity",
            x = "Input Size (n)",
            y = "Number of Operations (N)",
            color = "Complexity"
          ) +
      theme_minimal() +
      theme(
            plot.title = element_text(size = 16, face = "bold"),
            legend.title = element_text(size = 12),
            legend.text = element_text(size = 10)
      )

A line plot of Computational Complexity — It is good to know if the algorithm can handle my problem well!

Compirison of Computational Complexity

1.2 DBSCAN time complexity

DBSCAN’s runtime is dominated by how you perform neighborhood (range) queries. The classic “it’s \(O(n²)\)” is only the worst case of a spectrum that depends on data distribution, dimensionality, the index used, and the radius parameter \(\varepsilon\).

1.2.1 Core steps and cost drivers

Range queries: For each point, find all neighbors within radius \(\varepsilon\). Distance evaluation for one pair costs \(O(d)\) in \(d\)-dimensional Euclidean space.
Cluster expansion: A queue-based flood-fill that repeatedly issues range queries starting from core points (points with at least \(\text{minPts}\) neighbors within \(\varepsilon\)).

Asymptotically, the number and cost of range queries dominate; the expansion logic is linear in the number of discovered neighbors but tied to the same range-query results.

1.2.2 Complexity by neighborhood search strategy

1.2.2.1 Brute-force (no index)

Per range query: compute distance to all points ⇒ \(O(n \cdot d)\).
One query per point in the simplest implementation ⇒ \(O(n^2 \cdot d)\).
Cluster expansion may reuse queries or cause repeats; asymptotically the bound remains \(\Theta(n^2 \cdot d)\) in the worst case.

1.2.2.2 Space-partitioning trees (kd-tree, ball tree, R-tree-like)

Index build: typically \(O(n \log n)\) time, \(O(n)\) space.
Range query cost:
- Best/average (well-behaved low–moderate \(d\), balanced tree, moderate \(\varepsilon\)): \(O(\log n + k)\), where \(k\) is the number of reported neighbors.
- Worst case (high \(d\), large \(\varepsilon\), or adversarial data): degenerates to \(O(n)\).
Overall:
- Best/average: \(O(n \log n + \sum \limits_{i=1}^{n} (\log n + k_i)) = O(n \log n + K)\), where \(K = \sum k_i\) is total neighbor reports. If density per query is bounded, \(K = O(n)\), giving \(O(n \log n)\) plus distance cost factor \(O(d)\).
- Worst: \(O(n^2 \cdot d)\).

1.2.2.3 Uniform grid (fixed-radius hashing) in low dimensions

Build grid hashing once: expected \(O(n)\).
Range query: constant number of adjacent cells, expected \(O(1 + k)\) in 2D/3D if \(\varepsilon\) is aligned with cell size and data are not pathologically skewed.
Overall in practice: near-linear \(O(n + K)\), again with distance cost \(O(d)\), but this approach becomes brittle as \(d\) grows.

In all cases, include the distance computation factor \(O(d)\). For high dimensions, tree and grid pruning effectiveness collapses, pushing complexity toward \(\Theta(n^2 \cdot d)\).

1.3 What influences the runtime

Dimensionality \(d\): Distance costs scale with \(O(d)\), and index pruning degrades with the curse of dimensionality, often turning tree queries into \(O(n)\).
Neighborhood radius \(\varepsilon\): Larger \(\varepsilon\) increases average neighbor count \(k\), raises \(K=\sum k_i\), and triggers more expansions; in the limit, most points neighbor each other ⇒ near \(O(n^2 \cdot d)\).
Data distribution and density: Well-separated, roughly uniform, low-density data favor subquadratic performance with indexes. Dense clusters or large connected components increase expansions and repeats.
minPts: Affects how many points become core (thus how much expansion occurs). It changes constants and practical behavior but not the worst-case big-O bound.
Distance metric: Non-Euclidean metrics can alter pruning efficacy and per-distance cost.
Implementation details: Caching of range queries, deduplication, and “expand only once per point” optimizations materially reduce constants.

1.4 Practical scenarios

Scenario	Assumptions	Typical complexity
Brute-force baseline	Any \(d\), no index	\(\Theta(n^2 \cdot d)\)
Tree index, low–moderate \(d\)	Balanced tree, moderate \(\varepsilon\), bounded neighbor counts	\(O(n \log n \cdot d)\)
Tree index, high \(d\) or large \(\varepsilon\)	Poor pruning, dense neighborhoods	\(\Theta(n^2 \cdot d)\)
Grid index (2D/3D)	Well-chosen cell size, non-pathological data	Near \(O(n \cdot d)\)
Approximate NN/range	ANN structure (e.g., HNSW), approximate neighbors	Subquadratic wall-time; formal bounds vary

Note

The index build cost \(O(n \log n)\) (trees) or \(O(n)\) (grid) is typically amortized across all range queries.
If each point’s neighbor count \(k_i\) is bounded by a small constant on average, and pruning works, the total neighbor reports \(K\) is \(O(n)\), leading to near \(O(n \log n)\) behavior with trees.

But… What if you neither can reduce your dataset… nor change the parameter of the DBSCAN?

1.5 Solution…? At least for this case :)

Use more efficient algorithms that fully leverage your hardware’s full potential!

For this case, I used dbscan-python, which is a high-performance parallel implementation of DBSCAN, based on the SIGMOD 2020 paper “Theoretically Efficient and Practical Parallel DBSCAN.”

This work achieves theoretically-efficient clustering by minimizing total work and maintaining polylogarithmic parallel depth, enabling scalable performance on large datasets.

Compared to the naive \(O(n^2)\) approach, the proposed algorithm reduces time complexity to \(O(n \log n)\) work and \(O(\log n)\) depth in 2D, and sub-quadratic work in higher dimensions through grid-based partitioning, parallel union-find, and spatial indexing techniques.

As the name of the package implies, I need to use Python again for this experiment.

2 Experiments

2.1 Setups

2.1.1 Used R Packages

Code

library(dbscan)


Attaching package: 'dbscan'

The following object is masked from 'package:stats':

    as.dendrogram

Code

library(data.table)
library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

Code

library(future.apply)

Loading required package: future

Code

library(ggplot2)

2.1.2 Python Setup

Code

library(reticulate)



if (Sys.info()[[1]]=="Windows") {
        
    # For my Windows Environment
    use_condaenv("C:/Users/dydgn/miniforge3/envs/dbscan/python.exe")
    
    } else{
    
    system("micromamba install -n baseline -c conda-forge dbscan -y")

    # For github actions to utilize CI/CD

    use_condaenv("/home/runner/micromamba/envs/baseline/bin/python", required = TRUE)   

    
}



sys <- import("sys")

sys$executable

[1] "/home/runner/micromamba/envs/baseline/bin/python"

2.2 Generated Lab Dataset

First, I had to replicate a famous dataset for demonstrating DBSCAN.

Code

generate_random_semicircle <- function(center_x, center_y, radius, start_angle, end_angle, n_points = 100, noise = 0.08) {
    
      angles <- runif(n_points, min = start_angle, max = end_angle)
      x <- center_x + radius * cos(angles) + rnorm(n_points, 0, noise)
      y <- center_y + radius * sin(angles) + rnorm(n_points, 0, noise)
      
      return(as.data.table(list(x = x, y = y)))
}


set.seed(123)


N <- 1e+5L


semicircle1 <- generate_random_semicircle(center_x = 0, center_y = 0.25, radius = 1,
                                          start_angle = pi, end_angle = 2 * pi, n_points = N)
semicircle1[, group := "Semicircle 1"]


semicircle2 <- generate_random_semicircle(center_x = 1, center_y = -0.25, radius = 1,
                                          start_angle = 0, end_angle = pi, n_points = N)
semicircle2[, group := "Semicircle 2"]



df <- rbindlist(list(semicircle1, semicircle2))


# visualization
ggplot(df, aes(x = x, y = y, color = group)) +
  geom_point(size = 2) +
  coord_fixed(ratio = 1) +
  theme_minimal() +
  labs(title = "Randomized Overlapping Semicircles", x = NULL, y = NULL) +
  theme(legend.position = "none")

Code

lab_data <- df[, !c("group"), with = FALSE]

2.2.1 R - Naive

Code

lab_data_r <- lab_data

epsilon <-  .08
minpts <- 200L #should be an integer for the python env



stime <- Sys.time()
db_result_r <- dbscan(lab_data_r, eps = epsilon, minPts = minpts)
etime <- Sys.time()

delta_t_r <- etime-stime; delta_t_r

Time difference of 9.05127 secs

Code

# result
table(db_result_r$cluster)


    0     1     2 
  143 99925 99932

Code

lab_data_r$group <- as.factor(db_result_r$cluster)

# visualization
ggplot(lab_data_r, aes(x = x, y = y, color = group)) +
    geom_point(size = 2) +
    coord_fixed(ratio = 1) +
    theme_minimal() +
    labs(title = "Randomized Overlapping Semicircles", x = NULL, y = NULL) +
    theme(legend.position = "none")

2.2.2 Python - `wangyiqiu/dbscan-python`

Code

lab_data_py <- lab_data

py$lab_data <- as.matrix(lab_data_py)
py$epsilon <- epsilon
py$minpts <- minpts

Code

import numpy as np
from dbscan import DBSCAN
import time

print("type(X):", type(lab_data))

type(X): <class 'numpy.ndarray'>

Code

print("shape(X):", getattr(lab_data, 'shape', None))

shape(X): (200000, 2)

Code

start = time.time()
labels, core_samples_mask = DBSCAN(lab_data, eps = epsilon, min_samples = minpts)
end = time.time()


delta_t = end - start

print(f"Elapsed time: {delta_t:.4f} seconds")

Elapsed time: 0.0561 seconds

Code



# 결과를 R로 반환
#r.labels = labels
#r.core_mask = core_samples_mask

Code

# result
table(py$labels)


   -1     0     1 
  143 99925 99932

Code

lab_data_py$group <- as.factor(py$labels+1) # as index number starts from 0

# visualization
ggplot(lab_data_py, aes(x = x, y = y, color = group)) +
      geom_point(size = 2) +
      coord_fixed(ratio = 1) +
      theme_minimal() +
      labs(title = "Randomized Overlapping Semicircles", x = NULL, y = NULL) +
      theme(legend.position = "none")

2.2.3 Results

Code

#speedup


cat(delta_t_r[[1]]/py$delta_t," times speed-up\n",sep = "")

161.4497 times speed-up

2.3 Some Real World Dataset

Source: NYC Taxi Data

Code

months <- 1

urls <- paste0("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-",sprintf("%02d", 1:months),".parquet")


plan(multisession,workers = 2)

taxi_list <- future_lapply(urls, function(url) as.data.table(read_parquet(url))) |> rbindlist()

gc()

           used  (Mb) gc trigger  (Mb)  max used  (Mb)
Ncells  2684215 143.4    4876622 260.5   3829681 204.6
Vcells 55665917 424.7  104465195 797.1 103119473 786.8

Code

head(taxi_list)

   VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
      <int>               <POSc>                <POSc>           <int>
1:        2  2024-01-01 00:57:55   2024-01-01 01:17:43               1
2:        1  2024-01-01 00:03:00   2024-01-01 00:09:36               1
3:        1  2024-01-01 00:17:06   2024-01-01 00:35:01               1
4:        1  2024-01-01 00:36:38   2024-01-01 00:44:56               1
5:        1  2024-01-01 00:46:51   2024-01-01 00:52:57               1
6:        1  2024-01-01 00:54:08   2024-01-01 01:26:31               1
   trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID
           <num>      <int>             <char>        <int>        <int>
1:          1.72          1                  N          186           79
2:          1.80          1                  N          140          236
3:          4.70          1                  N          236           79
4:          1.40          1                  N           79          211
5:          0.80          1                  N          211          148
6:          4.70          1                  N          148          141
   payment_type fare_amount extra mta_tax tip_amount tolls_amount
          <int>       <num> <num>   <num>      <num>        <num>
1:            2        17.7   1.0     0.5       0.00            0
2:            1        10.0   3.5     0.5       3.75            0
3:            1        23.3   3.5     0.5       3.00            0
4:            1        10.0   3.5     0.5       2.00            0
5:            1         7.9   3.5     0.5       3.20            0
6:            1        29.6   3.5     0.5       6.90            0
   improvement_surcharge total_amount congestion_surcharge Airport_fee
                   <num>        <num>                <num>       <num>
1:                     1        22.70                  2.5           0
2:                     1        18.75                  2.5           0
3:                     1        31.30                  2.5           0
4:                     1        17.00                  2.5           0
5:                     1        16.10                  2.5           0
6:                     1        41.50                  2.5           0

Code

# select numeric rows & filtering out NAs
taxi_numeric_dt <- taxi_list[
    trip_distance > 0 & fare_amount > 0 & !is.na(trip_distance) & !is.na(fare_amount),
    .(trip_distance, fare_amount)
]

rm(taxi_list)
gc()

           used  (Mb) gc trigger  (Mb)  max used  (Mb)
Ncells  2690799 143.8    4876622 260.5   3829681 204.6
Vcells 13999678 106.9   83572156 637.7 103119473 786.8

Code

taxi_numeric_dt_200k <- taxi_numeric_dt[sample(.N, 200000)]
#taxi_numeric_dt_10m <- taxi_numeric_dt[sample(.N, 10000000)]


py$taxi_numeric_dt_200k <- as.matrix(taxi_numeric_dt_200k)
#py$taxi_numeric_dt_10m <- as.matrix(taxi_numeric_dt_10m)

2.3.1 R - Naive

Code

epsilon <-  0.3
minpts <- 10L


stime <- Sys.time()
db_result_r <- dbscan(taxi_numeric_dt_200k, eps = epsilon, minPts = minpts)
etime <- Sys.time()

delta_t_r <- etime-stime; delta_t_r

Time difference of 11.44536 secs

Code

table(db_result_r$cluster)


     0      1      2      3      4      5      6      7      8      9     10 
  4389 163774   6336   4349     79     24   1035   6355    206   1185   2048 
    11     12     13     14     15     16     17     18     19     20     21 
  1503   1218    270    301    617    893    311    106    145    313    247 
    22     23     24     25     26     27     28     29     30     31     32 
   244    161     74     70    224    205    155    155    437    137    449 
    33     34     35     36     37     38     39     40     41     42     43 
    39    135     68     76    164     29    167     37     39    118     40 
    44     45     46     47     48     49     50     51     52     53     54 
    27     39     18     17     16    142     49     72     44     19     11 
    55     56     57     58     59     60     61     62     63     64     65 
    15      7     14     32     14     31     11     10     21     31     12 
    66     67     68     69     70     71     72     73     74     75     76 
    11     18     24     20     30     12     11     20     10     33     14 
    77     78     79     80     81     82     83     84     85     86     87 
    10     13      9     10     11     16     11     12      9     10      7 
    88     89     90     91     92     93     94     95     96     97     98 
    10     10     10      3     10     10      7     10     10     10     10

2.3.2 Python - `wangyiqiu/dbscan-python`

Code

py$taxi_numeric_dt_200k <- as.matrix(taxi_numeric_dt_200k)
py$epsilon <- epsilon
py$minpts <- minpts

Code

import numpy as np
from dbscan import DBSCAN
import time

print("type(X):", type(taxi_numeric_dt_200k))

type(X): <class 'numpy.ndarray'>

Code

print("shape(X):", getattr(taxi_numeric_dt_200k, 'shape', None))

shape(X): (200000, 2)

Code

start = time.time()
labels, core_samples_mask = DBSCAN(taxi_numeric_dt_200k, eps=0.3, min_samples=10)
end = time.time()



delta_t = end - start

print(f"Elapsed time: {delta_t:.4f} seconds")

Elapsed time: 0.0379 seconds

2.3.3 Results

The code below shows the results are valid, as it counts how many rows are assigned to each label, sorts the counts in ascending order, and returns the difference between the two label distributions.

Code

sort(as.vector(table(as.factor(py$labels)))) - sort(as.vector(table(as.factor(db_result_r$cluster))))

 [1]  3  0  1  3  1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  0  0  0  1
[26]  1  0  0  1  0  0  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  1  0  0  0  0
[51]  1  0  0  0  0  0  0  0  0  0  0  0  0 -1 -2  0  0  0  0  0  0  0  0  0  0
[76] -3  0  0  0  0  0  0  0  0  0 -1  0  0  0 -2  3  0 -2  0  0  0  0  0 -6

A slight difference is observed. but it appears to stem from floating-point or grid based operations, hence it is negligible.

Code

#speedup


cat(delta_t_r[[1]]/py$delta_t," times speed-up\n",sep = "")

301.7172 times speed-up

What a whopping improvement isn’t it?

3 Take-home message

DBSCAN is a slow algorithm

Worst-case time: \(\Theta(n^2 \cdot d)\) regardless of indexing.
Well-behaved low-dimensional data with effective indexing can approach \(O(n \log n \cdot d)\).
As \(d\) or \(\varepsilon\) grow, expect degradation toward quadratic behavior.
Tuning \(\varepsilon\), choosing appropriate indexes, and reducing \(d\) (e.g., via PCA) often matter more than micro-optimizations.
If you cannot apply the above strategies, use a high-performance implementation.

4 Environment info

Code

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] reticulate_1.43.0   future.apply_1.20.0 future_1.67.0      
[4] arrow_21.0.0.1      dbscan_1.2.3        ggplot2_3.5.2      
[7] data.table_1.17.8  

loaded via a namespace (and not attached):
 [1] Matrix_1.7-3       bit_4.6.0          gtable_0.3.6       jsonlite_2.0.0    
 [5] dplyr_1.1.4        compiler_4.5.0     renv_1.1.5         tidyselect_1.2.1  
 [9] Rcpp_1.1.0         parallel_4.5.0     assertthat_0.2.1   png_0.1-8         
[13] globals_0.18.0     scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
[17] lattice_0.22-7     R6_2.6.1           labeling_0.4.3     generics_0.1.4    
[21] knitr_1.50         htmlwidgets_1.6.4  tibble_3.3.0       pillar_1.11.0     
[25] RColorBrewer_1.1-3 rlang_1.1.6        xfun_0.52          bit64_4.6.0-1     
[29] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     digest_0.6.37     
[33] grid_4.5.0         lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.4    
[37] glue_1.8.0         listenv_0.9.1      farver_2.1.2       codetools_0.2-20  
[41] parallelly_1.45.1  rmarkdown_2.29     purrr_1.1.0        tools_4.5.0       
[45] pkgconfig_2.0.3    htmltools_0.5.8.1

Code

if (Sys.info()[[1]]=="Windows") {
        
    # For my Windows Environment
    system("systeminfo",intern = T)
    
    
    } else{
    
    system("lscpu; free -h",intern = T)

    
}

 [1] "Architecture:                         x86_64"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [2] "CPU op-mode(s):                       32-bit, 64-bit"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 [3] "Address sizes:                        48 bits physical, 48 bits virtual"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [4] "Byte Order:                           Little Endian"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [5] "CPU(s):                               4"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [6] "On-line CPU(s) list:                  0-3"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
 [7] "Vendor ID:                            AuthenticAMD"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
 [8] "Model name:                           AMD EPYC 7763 64-Core Processor"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [9] "CPU family:                           25"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[10] "Model:                                1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[11] "Thread(s) per core:                   2"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[12] "Core(s) per socket:                   2"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[13] "Socket(s):                            1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[14] "Stepping:                             1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[15] "BogoMIPS:                             4890.86"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[16] "Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm"
[17] "Virtualization:                       AMD-V"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[18] "Hypervisor vendor:                    Microsoft"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[19] "Virtualization type:                  full"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[20] "L1d cache:                            64 KiB (2 instances)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[21] "L1i cache:                            64 KiB (2 instances)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[22] "L2 cache:                             1 MiB (2 instances)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[23] "L3 cache:                             32 MiB (1 instance)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[24] "NUMA node(s):                         1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[25] "NUMA node0 CPU(s):                    0-3"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[26] "Vulnerability Gather data sampling:   Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[27] "Vulnerability Itlb multihit:          Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[28] "Vulnerability L1tf:                   Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[29] "Vulnerability Mds:                    Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[30] "Vulnerability Meltdown:               Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[31] "Vulnerability Mmio stale data:        Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[32] "Vulnerability Reg file data sampling: Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[33] "Vulnerability Retbleed:               Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[34] "Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[35] "Vulnerability Spec store bypass:      Vulnerable"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[36] "Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[37] "Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[38] "Vulnerability Srbds:                  Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[39] "Vulnerability Tsx async abort:        Not affected"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[40] "               total        used        free      shared  buff/cache   available"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[41] "Mem:            15Gi       2.9Gi       3.5Gi        52Mi       9.2Gi        12Gi"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[42] "Swap:          4.0Gi          0B       4.0Gi"

1 Backdrop

1.1 Analyzing complexity factors

1.2 DBSCAN time complexity

1.2.1 Core steps and cost drivers

1.2.2 Complexity by neighborhood search strategy

1.2.2.1 Brute-force (no index)

1.2.2.2 Space-partitioning trees (kd-tree, ball tree, R-tree-like)

1.2.2.3 Uniform grid (fixed-radius hashing) in low dimensions

1.3 What influences the runtime

1.4 Practical scenarios

1.5 Solution…? At least for this case :)

2 Experiments

2.1 Setups

2.1.1 Used R Packages

2.1.2 Python Setup

2.2 Generated Lab Dataset

2.2.1 R - Naive

2.2.2 Python - wangyiqiu/dbscan-python

2.2.3 Results

2.3 Some Real World Dataset

2.3.1 R - Naive

2.3.2 Python - wangyiqiu/dbscan-python

2.3.3 Results

3 Take-home message

4 Environment info

2.2.2 Python - `wangyiqiu/dbscan-python`

2.3.2 Python - `wangyiqiu/dbscan-python`