Sampling Variance

Variance estimation with bootstrap chain and tree methods. Although resampling is incorporated within the estimation functions, users who wish to perform resampling separately can use RDSboot or RDSBootOptimizedParallel. After preprocessing with RDSdata, ensure the presence of at least four variables: respondent ID, seed ID, seed indicator, and recruiter ID. Note that the sampling of respondents (seeds and recruits) is conducted with replacement, and the resulting data frame will contain duplicates.

There are six bootstrap methods available: ‘chain1’, ‘chain2’, ‘tree_uni1’, ‘tree_uni2’, ‘tree_bi1’, ‘tree_bi2’

In all bootstrap methods, versions 1 and 2 differ as version 1 sets the number of seeds in a given resample to be consistent with the number of seeds in the original sample (\(s\)), while version 2 sets the sample size of a given resample (\(n_r\)) to be at least equal to or greater than the original sample (\(n_s\)).

‘chain1’ selects \(s\) seeds using SRSWR from all seeds in the original sample and then all nodes in the chains created by each of the resampled seeds are retained. With ‘chain2’, 1 seed is sampled using SRSWR from all seeds in the original sample, and all nodes from the chain created by this seed are retained. It then compares \(n_r\) against \(n_s\), and, if \(n_r < n_s\), continues the resampling process by drawing 1 seed and its chains one by one until \(n_r \geq n_s\).

In the ‘tree_uni1’ method, \(s\) seeds are selected using Simple Random Sampling with Replacement (SRSWR) from all seeds. For each selected seed, this method (A) checks its recruit counts, (B) selects SRSWR of the recruits counts from all recruits identified in (A), and (C) for each sampled recruit, this method repeats Steps A and B. (D) Steps A, B, and C continue until reaching the last wave of each chain. In ‘tree_uni2’, instead of selecting \(s\) seeds, it selects one seed, performs Steps B and C for the selected seed. It compares the size of the resample (\(n_r\)) and the original sample (\(n_s\)), and, if \(n_r < n_s\), it continues the resampling process by drawing 1 seed, performs Steps B and C and checks \(n_r\) against \(n_s\). If \(n_r < n_s\), the process continues until the sample size of a given resample (\(n_r\)) is at least equal to the original sample size (\(n_s\)), i.e., \(n_r \geq n_s\).

‘tree_bi1’ selects \(s\) nodes from the recruitment chains using SRSWR. For each selected node, it (A) checks its connected nodes (i.e., both recruiters and recruits) and their count, (B) from all connected nodes identified in (A), performs SRSWR of the same node count, and (C) for each selected node, performs steps A and B, but does not resample already resampled nodes. (D) Steps A, B, and C are repeated until the end of the chain. In ‘tree_bi2’, instead of \(s\) nodes, it selects 1 node using SRSWR from anywhere in all recruitment chains and repeats steps (B),(C), and (D) until \(n_r \geq n_s\).

RDSboot - Standard Bootstrap

Bootstrap Resampling for Respondent Driven Sampling (RDS). This function performs resampling RDS sample data by bootstrapping edges in recruitment trees or bootstrapping recruitment chains as a whole.

Usage

RDSboot(data, respondent_id_col, seed_id_col, seed_col, recruiter_id_col, type, resample_n)

Arguments

data: pd.DataFrame. The input DataFrame containing RDS data.
respondent_id_col: str. Name of the column containing respondent IDs - A variable indicating respondent ID.
seed_id_col: str. Name of the column containing seed IDs - A variable indicating seed ID.
seed_col: str. Name of the column containing seed indicators - A variable indicating whether a particular respondent is seed or not.
recruiter_id_col: str. Name of the column containing recruiter IDs - A variable indicating recruiter ID.
type: str. One of the six types of bootstrap methods: (1) ‘chain1’, (2) ‘chain2’, (3) ‘tree_uni1’, (4) ‘tree_uni2’, (5) ‘tree_bi1’, (6) ‘tree_bi2’.
resample_n: int. Specifies the number of resamples.

Returns

pd.DataFrame

Returns a data frame consisting of the following elements:

RESPONDENT_ID: A variable indicating respondent ID
RESAMPLE.N: An indicator variable for each resample iteration

Example

from RDSTools import RDSboot

# Bootstrap resampling
boot_results = RDSboot(
    data=rds_data,
    respondent_id_col='ID',
    seed_id_col='S_ID',
    seed_col='SEED',
    recruiter_id_col='R_ID',
    type='tree_uni1',
    resample_n=1000
)

RDSBootOptimizedParallel - Parallel Bootstrap

Parallelized Bootstrap Resampling for Respondent Driven Sampling (RDS). This function performs resampling RDS sample data by bootstrapping edges in recruitment trees or bootstrapping recruitment chains as a whole with parallel processing.

Combines: 1. Dictionary-based lookups for 1.2-1.6x speedup 2. Multi-core parallelization

Usage

RDSBootOptimizedParallel(data, respondent_id_col, seed_id_col, seed_col, recruiter_id_col, type, resample_n, n_cores=2)

Arguments

data: pd.DataFrame. The input DataFrame containing RDS data.
respondent_id_col: str. Name of the column containing respondent IDs - A variable indicating respondent ID.
seed_id_col: str. Name of the column containing seed IDs - A variable indicating seed ID.
seed_col: str. Name of the column containing seed indicators - A variable indicating whether a particular respondent is seed or not.
recruiter_id_col: str. Name of the column containing recruiter IDs - A variable indicating recruiter ID.
type: str. One of the six types of bootstrap methods: (1) ‘chain1’, (2) ‘chain2’, (3) ‘tree_uni1’, (4) ‘tree_uni2’, (5) ‘tree_bi1’, (6) ‘tree_bi2’.
resample_n: int. Specifies the number of resamples.
n_cores: int, optional. Number of cores to use for parallel processing. If None, uses all available cores. Default is 2.

Returns

pd.DataFrame

Returns a data frame consisting of the following elements:

RESPONDENT_ID: A variable indicating respondent ID
RESAMPLE.N: An indicator variable for each resample iteration

Example

from RDSTools import RDSBootOptimizedParallel

# Parallel bootstrap resampling with 4 cores
boot_results = RDSBootOptimizedParallel(
    data=rds_data,
    respondent_id_col='ID',
    seed_id_col='S_ID',
    seed_col='SEED',
    recruiter_id_col='R_ID',
    type='tree_uni1',
    resample_n=1000,
    n_cores=4
)

Working with Results

The bootstrap results can be merged with the original data to examine resampled characteristics:

# Get first bootstrap sample
sample_1 = boot_results[boot_results['RESAMPLE.N'] == 1]

# Merge with original data
merged = pd.merge(sample_1, rds_data,
                 left_on='RESPONDENT_ID', right_on='ID')

# Check characteristics
print(f"Original sample size: {len(rds_data)}")
print(f"Bootstrap sample size: {len(merged)}")
print(f"Original seeds: {rds_data['SEED'].sum()}")
print(f"Bootstrap seeds: {merged['SEED'].sum()}")

Performance Considerations

For large datasets or high numbers of resamples, consider using the parallel version:

# For large-scale bootstrap operations
boot_results = RDSBootOptimizedParallel(
    data=rds_data,
    respondent_id_col='ID',
    seed_id_col='S_ID',
    seed_col='SEED',
    recruiter_id_col='R_ID',
    type='tree_uni1',
    resample_n=10000,  # Large number of resamples
    n_cores=8  # Use 8 cores for parallel processing
)