Estimation
RDS Tools package has 3 estimation functions: (1) RDSmean, (2) RDStable, (3) RDSlm. Users can control the use of weights, selection of variance estimation, as well as the number of resamples if one of the variance estimation resampling approaches is used. Note, before using the estimation functions, please make sure that you preprocess data with RDSdata function.
RDSmean - Descriptive Statistics
Estimating mean with respondent driven sampling sample data. This function calculates weighted or unweighted means for either a continuous or a categorical variable. For continuous variables, it returns a single mean and standard error. For categorical variables, it returns one proportion and standard error per level. Standard errors are calculated using naive (delta-method) or resampling approaches from ‘RDSboot’.
Usage
RDSmean(x, data, weight=None, var_est=None, resample_n=None, n_cores=None, na_rm=True, return_bootstrap_means=False, return_node_counts=False)
Arguments
- x
str. A variable of interest. Continuous variables (numeric dtypes) return a single mean and SE. Categorical variables (object, string, bool, or pandas Categorical dtypes) return one proportion and SE per level. Note that integer-coded categoricals such as
Race=1,2,3are treated as numeric by default; convert them withdata[col] = data[col].astype('category')before callingRDSmeanto get per-category output.- data
pandas.DataFrame. The output DataFrame from RDSdata.
- weight
str, optional. Name of the weight variable. User specified weight variable for a weighted analysis. When set to None, the function performs an unweighted analysis. Default is None.
- var_est
str, optional. One of the six bootstrap types or the delta (naive) method. By default the function calculates naive standard errors. Variance estimation options include ‘naive’ or bootstrap methods like ‘chain1’, ‘chain2’, ‘tree_uni1’, ‘tree_uni2’, ‘tree_bi1’, ‘tree_bi2’. Default is None (naive).
- resample_n
int, optional. Specifies the number of resample iterations. Note that this argument is None when var_est = ‘naive’. Required for bootstrap methods, default 300.
- n_cores
int, optional. Number of CPU cores to use for parallel bootstrap processing. If specified, uses optimized parallel bootstrap. If None, uses standard sequential bootstrap. Default is None.
- na_rm
bool, optional. If True (default), observations with missing values in
x(or in the weight column, when supplied) are removed before estimation. If False, missing values are retained and the estimator returnsNaNwhenever NAs are present, mirroring R’ssvymean(..., na.rm = FALSE)behaviour. Default is True.- return_bootstrap_means
bool, optional. If True, return the per-iteration estimates along with the main results (only for bootstrap methods). For continuous variables this is a list of scalar means; for categorical variables it is a list of proportion arrays aligned with the level order. Default is False.
- return_node_counts
bool, optional. If True, return sample size per iteration along with main results (only for bootstrap methods). Default is False.
Returns
- RDSResult or tuple
An RDSResult object containing the following elements:
- results
DataFrame; A tidy results table. For continuous variables, columns are
MeanandSEwith a single row. For categorical variables, columns areCategory,Mean, andSEwith one row per level. For categorical variables, the reported “Mean” for each level is the estimated proportion of observations in that level.- additional_info
Information about the estimation: (1) SE method: variance estimation method (2) Weight: indicator of whether weighted analysis was used (3) n_Data: total number of observations in the input data (4) n_Analysis: number of observations used in the analysis (after NA removal when
na_rm=True) (5) n_Iteration: number of resampling iterations (if SE method is not ‘naive’) (6) n_Dropped: number of bootstrap iterations skipped due to errors (if SE method is not ‘naive’)- resample_summary
Descriptive summary of resamples if var_est is not ‘naive’: mean, SD, min, quartiles, and max of resample sizes
- resample_estimates
Per-iteration estimates if var_est is not ‘naive’. For continuous variables, a list of scalar means (one per iteration). For categorical variables, a list of proportion arrays (one per iteration, aligned with the level order).
- When return_bootstrap_means=False and return_node_counts=False (default):
Returns RDSResult object only
- When return_bootstrap_means=True and return_node_counts=False:
Returns (RDSResult, bootstrap_estimates_list)
- When return_bootstrap_means=False and return_node_counts=True:
Returns (RDSResult, node_counts_list)
- When return_bootstrap_means=True and return_node_counts=True:
Returns (RDSResult, bootstrap_estimates_list, node_counts_list)
Notes
- The RDSResult object is a pandas DataFrame subclass that:
Retains all DataFrame functionality for analysis
Has custom print formatting for clean display
Exposes the tidy results table via
result.resultsand the underlying bootstrap estimates and node counts as attributes
For categorical variables, the reported “Mean” for each level is the estimated proportion of observations in that level. Each level has its own standard error
Integer-coded categorical variables (such as Race=1,2,3) are treated as numeric by default and will produce a single mean rather than per-category proportions. To obtain per-category output, convert the column with data[col] = data[col].astype('category') before calling RDSmean.
Examples
from RDSTools import RDSmean
# Basic mean with naive variance estimation (continuous variable)
result = RDSmean(x='Age', data=rds_data, var_est='naive')
# Weighted analysis with inverse weights
result = RDSmean(x='Age', data=rds_data, weight='WEIGHT')
# Categorical variable: convert to category dtype first
rds_data['Race'] = rds_data['Race'].astype('category')
result = RDSmean(x='Race', data=rds_data, weight='WEIGHT')
# Output is a tidy table with one row per Race level
# Retain NAs and propagate to NaN (instead of dropping)
result = RDSmean(x='Age', data=rds_data, na_rm=False)
# Bootstrap method with resampling
result = RDSmean(
x='Age',
data=rds_data,
weight='WEIGHT',
var_est='chain1',
resample_n=1000
)
# Categorical bootstrap: per-category proportions and bootstrap SEs
rds_data['Race'] = rds_data['Race'].astype('category')
result = RDSmean(
x='Race',
data=rds_data,
weight='WEIGHT',
var_est='chain1',
resample_n=300
)
# Parallel processing with 4 cores
result = RDSmean(
x='Age',
data=rds_data,
var_est='tree_uni1',
resample_n=1000,
n_cores=4
)
# Return bootstrap estimates and node counts
result, bootstrap_estimates, node_counts = RDSmean(
x='Age',
data=rds_data,
var_est='tree_uni1',
resample_n=1000,
return_bootstrap_means=True,
return_node_counts=True
)
RDStable - Contingency Tables
Estimating one and two-way tables with respondent driven sampling sample data. One-way tables are constructed by specifying a categorical variable for x argument only. Two-way tables are constructed by specifying two categorical variables for x and y arguments. Standard errors of proportions are calculated using naive or resampling approaches from ‘RDSboot’.
Usage
RDStable(x, y=None, data=None, weight=None, var_est=None, resample_n=None, margins=3, n_cores=None, return_bootstrap_tables=False, return_node_counts=False)
Arguments
- x
str. Column name; For a 1-way table, specify one categorical variable. By default the function returns a 1-way table.
- y
str, optional. Column name; Optional, for 2-way tables specify the second categorical variable of interest. Default is None.
- data
pandas.DataFrame. The output DataFrame from RDSdata.
- weight
str, optional. Name of the weight variable. User specified weight variable for a weighted analysis. When set to NULL, the function performs an unweighted analysis. Default is None.
- var_est
str, optional. One of the six bootstrap types or the delta (naive) method. By default the function calculates naive standard errors. Variance estimation options include ‘naive’ or bootstrap methods like ‘chain1’, ‘chain2’, ‘tree_uni1’, ‘tree_uni2’, ‘tree_bi1’, ‘tree_bi2’. Default is None (naive).
- resample_n
int, optional. Specifies the number of resample iterations. Note that this argument is None when var_est = ‘naive’. Required for bootstrap methods, default 300.
- margins
int, optional. For two-way tables: 1=row proportions, 2=column proportions, 3=cell proportions (default). Default is 3.
- n_cores
int, optional. Number of CPU cores to use for parallel bootstrap processing. If specified, uses optimized parallel bootstrap. If None, uses standard sequential bootstrap. Default is None.
- return_bootstrap_tables
bool, optional. If True, return bootstrap table estimates along with main results (only for bootstrap methods). Default is False.
- return_node_counts
bool, optional. If True, return sample size per iteration along with main results (only for bootstrap methods). Default is False.
Returns
- RDSTableResult or tuple
An RDSTableResult object containing the following elements:
- formula
Formula; Variable(s) used for the estimation
- results
DataFrame or tables; Weighted or unweighted proportions (prop_table) and their standard errors (se_table)
- additional_info
Information about the estimation: (1) SE method: variance estimation method (2) Weight: indicator of whether weighted analysis was used (3) n_Data: total number of observations in the input data (4) n_Analysis: number of observations used in the analysis (after NA removal on the table variable(s)) (5) n_Iteration: number of resampling iterations (if SE method is not ‘naive’) (6) n_Dropped: number of bootstrap iterations skipped due to errors (if SE method is not ‘naive’)
- resample_summary
Descriptive summary of resamples if var_est is not ‘naive’: mean, SD, min, quartiles, and max of resample sizes
- resample_estimates
Proportions calculated for each resampling iteration if var_est is not ‘naive’
- When return_bootstrap_tables=False and return_node_counts=False (default):
Returns RDSTableResult object only
- When return_bootstrap_tables=True and return_node_counts=False:
Returns (RDSTableResult, bootstrap_tables_list)
- When return_bootstrap_tables=False and return_node_counts=True:
Returns (RDSTableResult, node_counts_list)
- When return_bootstrap_tables=True and return_node_counts=True:
Returns (RDSTableResult, bootstrap_tables_list, node_counts_list)
Notes
- The RDSTableResult object is a custom class that:
Provides formatted display of contingency tables
Includes cell counts, proportions, and standard errors
Supports different margin calculations (row, column, cell)
Provides access to bootstrap tables and node counts
Examples
from RDSTools import RDStable
# One-way table
result = RDStable(x="Sex", data=rds_data)
# Two-way table with bootstrap variance estimation
result = RDStable(
x="Sex",
y="Race",
data=rds_data,
var_est='chain1',
resample_n=100
)
# Two-way table with row proportions and parallel processing
result = RDStable(
x="Sex",
y="Race",
data=rds_data,
var_est='tree_uni1',
resample_n=1000,
margins=1, # row proportions
n_cores=4
)
# Return bootstrap tables and node counts
result, bootstrap_tables, node_counts = RDStable(
x="Sex",
data=rds_data,
var_est='tree_uni1',
resample_n=1000,
return_bootstrap_tables=True,
return_node_counts=True
)
RDSlm - Linear and Logistic Regression
Linear and Logistic Regression Modeling with Respondent Driven Sampling (RDS) Sample Data. This function mimics the lm function in R stats package with capabilities to handle RDS data in model estimation. Standard errors of regression coefficients are calculated using naive or resampling approaches from ‘RDSboot’.
Usage
RDSlm(data, formula, weight=None, var_est=None, resample_n=None, n_cores=None, return_bootstrap_estimates=False, return_node_counts=False)
Arguments
- data
pandas.DataFrame. The output DataFrame from RDSdata.
- formula
str. Description of the model with dependent and independent variables. (e.g., “y ~ x1 + x2”). Note that the function performs linear regression when the dependent variable is numeric and logistic regression with binomial link function when the dependent variable is either character or factor.
- weight
str, optional. Name of the weight variable. User specified weight variable for a weighted analysis. When set to NULL, the function performs an unweighted analysis. Default is None.
- var_est
str, optional. One of the six bootstrap types or the delta (naive) method. By default, the function calculates naive standard errors. Variance estimation options include ‘naive’ or bootstrap methods like ‘chain1’, ‘chain2’, ‘tree_uni1’, ‘tree_uni2’, ‘tree_bi1’, ‘tree_bi2’. Default is None (naive).
- resample_n
int, optional. Specifies the number of resample iterations. Note that this argument is None when var_est = ‘naive’. Required for bootstrap methods, default 300.
- n_cores
int, optional. Number of CPU cores to use for parallel bootstrap processing. If specified, uses optimized parallel bootstrap. If None, uses standard sequential bootstrap. Default is None.
- return_bootstrap_estimates
bool, optional. If True, return bootstrap coefficient estimates along with main results (only for bootstrap methods). Default is False.
- return_node_counts
bool, optional. If True, return sample size per iteration along with main results (only for bootstrap methods). Default is False.
Returns
- RDSRegressionResult or tuple
An RDSRegressionResult object containing the following elements:
- formula
Formula; Variable(s) used for the estimation
- coefficients
DataFrame; Point estimates, standard errors, t-values (or z-values for logistic), and p-values for each coefficient
- model_fit
Model fit statistics; For linear regression: R-squared, Adjusted R-squared, F-statistic, and residual standard error. For logistic regression: null deviance, residual deviance, and AIC
- additional_info
Information about the estimation: (1) SE method: variance estimation method (2) Weight: indicator of whether weighted analysis was used (3) n_Data: total number of observations in the input data (4) n_Analysis: number of observations used in the analysis (after NA removal across all formula terms) (5) n_Iteration: number of resampling iterations (if SE method is not ‘naive’) (6) n_Dropped: number of bootstrap iterations skipped due to errors (if SE method is not ‘naive’)
- resample_summary
Descriptive summary of resamples if var_est is not ‘naive’: mean, SD, min, quartiles, and max of resample sizes
- resample_estimates
Coefficient estimates for each resampling iteration if var_est is not ‘naive’
- When return_bootstrap_estimates=False and return_node_counts=False (default):
Returns RDSRegressionResult object only
- When return_bootstrap_estimates=True and return_node_counts=False:
Returns (RDSRegressionResult, bootstrap_estimates_list)
- When return_bootstrap_estimates=False and return_node_counts=True:
Returns (RDSRegressionResult, node_counts_list)
- When return_bootstrap_estimates=True and return_node_counts=True:
Returns (RDSRegressionResult, bootstrap_estimates_list, node_counts_list)
Notes
In all bootstrap methods, versions 1 and 2 differ as version 1 sets the number of seeds in a given resample to be consistent with the number of seeds in the original sample (\(s\)), while version 2 sets the sample size of a given resample (\(n_r\)) to be at least equal to or greater than the original sample (\(n_s\)).
‘chain1’ selects \(s\) seeds using SRSWR from all seeds in the original sample and then all nodes in the chains created by each of the resampled seeds are retained. With ‘chain2’, 1 seed is sampled using SRSWR from all seeds in the original sample, and all nodes from the chain created by this seed are retained. It then compares \(n_r\) against \(n_s\), and, if \(n_r < n_s\), continues the resampling process by drawing 1 seed and its chains one by one until \(n_r \geq n_s\).
In the ‘tree_uni1’ method, \(s\) seeds are selected using Simple Random Sampling with Replacement (SRSWR) from all seeds. For each selected seed, this method (A) checks its recruit counts, (B) selects SRSWR of the recruits counts from all recruits identified in (A), and (C) for each sampled recruit, this method repeats Steps A and B. (D) Steps A, B, and C continue until reaching the last wave of each chain. In ‘tree_uni2’, instead of selecting \(s\) seeds, it selects one seed, performs Steps B and C for the selected seed. It compares the size of the resample (\(n_r\)) and the original sample (\(n_s\)), and, if \(n_r < n_s\), it continues the resampling process by drawing 1 seed, performs Steps B and C and checks \(n_r\) against \(n_s\). If \(n_r < n_s\), the process continues until the sample size of a given resample (\(n_r\)) is at least equal to the original sample size (\(n_s\)), i.e., \(n_r \geq n_s\).
‘tree_bi1’ selects \(s\) nodes from the recruitment chains using SRSWR. For each selected node, it (A) checks its connected nodes (i.e., both recruiters and recruits) and their count, (B) from all connected nodes identified in (A), performs SRSWR of the same node count, and (C) for each selected node, performs steps A and B, but does not resample already resampled nodes. (D) Steps A, B, and C are repeated until the end of the chain. In ‘tree_bi2’, instead of \(s\) nodes, it selects 1 node using SRSWR from anywhere in all recruitment chains and repeats steps (B),(C), and (D) until \(n_r \geq n_s\).
Examples
from RDSTools import RDSlm
# Linear regression (continuous dependent variable)
result = RDSlm(
data=rds_data,
formula="Age ~ C(Sex)",
weight='WEIGHT',
var_est='chain1',
resample_n=1000
)
# Logistic regression (categorical dependent variable)
# Make sure to convert to binary (0,1) if doesn't work.
result = RDSlm(
data=rds_data,
formula="Employed ~ Age + C(Sex)",
weight='WEIGHT',
var_est='chain1',
resample_n=100
)
# Parallel regression with multiple predictors
result = RDSlm(
data=rds_data,
formula="Income ~ Age + C(Education) + C(Race)",
var_est='tree_uni1',
resample_n=1000,
n_cores=4
)
# Return bootstrap estimates and node counts
result, bootstrap_estimates, node_counts = RDSlm(
data=rds_data,
formula="Age ~ C(Sex)",
var_est='tree_uni1',
resample_n=1000,
return_bootstrap_estimates=True,
return_node_counts=True
)