| Title: | Dataset Comparison with 'CDISC' Validation for Clinical Trial Data |
|---|---|
| Description: | A general-purpose toolkit for comparing any two data frames with optional 'CDISC' (Clinical Data Interchange Standards Consortium) validation for clinical trial data. Core comparison functions work on arbitrary datasets: variable-level and observation-level comparison, data type checking, metadata attribute analysis (types, labels, lengths, formats), missing value handling, key-based row matching, tolerance-based numeric comparisons, and group-wise comparisons. Optional z-score outlier detection is available when enabled. When working with clinical data, the package additionally validates 'SDTM' (Study Data Tabulation Model) and 'ADaM' (Analysis Data Model) datasets against CDISC standards (SDTM IG 3.3/3.4, ADaM IG 1.1/1.2/1.3), automatically detecting domains and flagging non-conformant variables. Generates unified comparison reports in text or HTML format with interactive dashboards. For CDISC standards, see <https://www.cdisc.org/standards>. |
| Authors: | Siddharth Lokineni [aut, cre] |
| Maintainer: | Siddharth Lokineni <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-05-21 08:43:37 UTC |
| Source: | https://github.com/siddharthlokineni/clincompare |
A comprehensive toolkit for comparing clinical trial datasets. Provides functions for dataset comparison including variable-level and observation-level differences, data type checking, and missing value analysis. Integrates CDISC validation for SDTM and ADaM datasets.
compare_datasetsHigh-level comparison of two datasets
compare_variablesCompare variable names and types
compare_observationsRow-wise value comparison
cdisc_compareCompare datasets with CDISC validation
validate_cdiscValidate a dataset against CDISC standards
detect_cdisc_domainAuto-detect CDISC domain or ADaM dataset
DM, AE, LB, VS, EX, CM, MH, DS, SV, TA, TE domains
ADSL, ADAE, ADLB, ADTTE, ADEFF datasets
Maintainer: Siddharth Lokineni [email protected]
Useful links:
Report bugs at https://github.com/siddharthlokineni/clinCompare/issues
Flagship function that compares two datasets AND runs CDISC validation on both. Combines dataset comparison with CDISC conformance analysis to provide comprehensive insights into both differences and regulatory compliance.
cdisc_compare( df1, df2, domain = NULL, standard = NULL, id_vars = NULL, vars = NULL, ts_data = NULL, detect_outliers = FALSE, tolerance = 0, where = NULL )cdisc_compare( df1, df2, domain = NULL, standard = NULL, id_vars = NULL, vars = NULL, ts_data = NULL, detect_outliers = FALSE, tolerance = 0, where = NULL )
df1 |
First data frame to compare, or a file path (character string
ending in |
df2 |
Second data frame to compare, or a file path. |
domain |
Optional character string specifying the CDISC domain code or dataset name (e.g., "DM", "AE", "ADSL"). Strongly recommended – auto-detection can be ambiguous for datasets with common columns. If NULL, auto-detected from df1. |
standard |
Optional character string: "SDTM" or "ADaM". If NULL, auto-detected from df1. |
id_vars |
Optional character vector of ID variable names (e.g.,
|
vars |
Optional character vector of variable names to compare. Only these columns are included in value comparison. Structural and CDISC validation still covers all columns. |
ts_data |
Optional data frame of the TS (Trial Summary) domain. When provided, CDISC standard versions (e.g., SDTM IG 3.4, ADaM IG 1.3) are extracted and included in the results and reports. If NULL (default), version information is omitted. |
detect_outliers |
Logical. When TRUE, runs z-score outlier detection on numeric columns and includes results in the output. Defaults to FALSE. |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
where |
Optional filter expression as a string (e.g., "AESEV == 'SEVERE'"). Applied to both datasets before comparison. Equivalent to a WHERE clause. |
A list containing:
domain |
Character: detected or supplied CDISC domain |
standard |
Character: detected or supplied CDISC standard (SDTM/ADaM) |
nrow_df1 |
Integer: number of rows in df1 |
ncol_df1 |
Integer: number of columns in df1 |
nrow_df2 |
Integer: number of rows in df2 |
ncol_df2 |
Integer: number of columns in df2 |
id_vars |
Character vector of ID variables used for matching (NULL if positional matching was used) |
comparison |
Result of |
variable_comparison |
Result of |
metadata_comparison |
List of metadata differences: type_mismatches, label_mismatches, length_mismatches, format_mismatches, column ordering |
observation_comparison |
Result of |
unified_comparison |
Data frame combining attribute and value differences per variable. Columns: variable, attribute, base_value, compare_value, and optionally id columns and row when value differences exist |
unmatched_rows |
List with df1_only and df2_only data frames of rows that could not be matched by id_vars (NULL when id_vars is not used) |
cdisc_validation_df1 |
CDISC validation results for df1 |
cdisc_validation_df2 |
CDISC validation results for df2 |
cdisc_conformance_comparison |
Data frame showing which CDISC issues are unique to df1, unique to df2, or common to both |
outlier_notes |
Data frame of z-score outliers (|z| > 3) found in numeric columns of either dataset (NULL when detect_outliers is FALSE) |
cdisc_version |
List of CDISC version information extracted from TS
domain (NULL when ts_data is not provided). See |
# Create sample SDTM DM domains dm1 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN"), stringsAsFactors = FALSE ) dm2 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ003"), DMSEQ = c(1, 1), RACE = c("WHITE", "ASIAN"), ETHNIC = c("NOT HISPANIC", "NOT HISPANIC"), stringsAsFactors = FALSE ) # Positional matching (default) result <- cdisc_compare(dm1, dm2, domain = "DM", standard = "SDTM") # Key-based matching by ID variables result <- cdisc_compare(dm1, dm2, domain = "DM", id_vars = c("USUBJID")) names(result)# Create sample SDTM DM domains dm1 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN"), stringsAsFactors = FALSE ) dm2 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ003"), DMSEQ = c(1, 1), RACE = c("WHITE", "ASIAN"), ETHNIC = c("NOT HISPANIC", "NOT HISPANIC"), stringsAsFactors = FALSE ) # Positional matching (default) result <- cdisc_compare(dm1, dm2, domain = "DM", standard = "SDTM") # Key-based matching by ID variables result <- cdisc_compare(dm1, dm2, domain = "DM", id_vars = c("USUBJID")) names(result)
Removes duplicate rows, standardizes column names and text values to uppercase or lowercase, and performs basic data cleaning on a data frame.
clean_dataset( df, variables = NULL, remove_duplicates = TRUE, convert_to_case = NULL )clean_dataset( df, variables = NULL, remove_duplicates = TRUE, convert_to_case = NULL )
df |
A data frame to be cleaned. |
variables |
Optional; a vector of variable names to specifically clean. If NULL, applies cleaning to all variables. |
remove_duplicates |
Logical; whether to remove duplicate rows. |
convert_to_case |
Optional; convert character variables to "lower" or "upper" case. |
A cleaned data frame.
df <- data.frame(name = c("Alice", "Bob", "Alice"), score = c(90, 85, 90), stringsAsFactors = FALSE) clean_dataset(df, remove_duplicates = TRUE, convert_to_case = "upper")df <- data.frame(name = c("Alice", "Bob", "Alice"), score = c(90, 85, 90), stringsAsFactors = FALSE) clean_dataset(df, remove_duplicates = TRUE, convert_to_case = "upper")
Compares two datasets within subgroups defined by grouping variables. Performs separate comparisons for each group and returns results organized by group.
compare_by_group(df1, df2, group_vars)compare_by_group(df1, df2, group_vars)
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
group_vars |
A character vector of column names to group by. |
A list of comparison results for each group.
df1 <- data.frame(region = c("A", "A", "B"), value = c(10, 20, 30), stringsAsFactors = FALSE) df2 <- data.frame(region = c("A", "A", "B"), value = c(10, 25, 30), stringsAsFactors = FALSE) compare_by_group(df1, df2, group_vars = "region")df1 <- data.frame(region = c("A", "A", "B"), value = c(10, 20, 30), stringsAsFactors = FALSE) df2 <- data.frame(region = c("A", "A", "B"), value = c(10, 25, 30), stringsAsFactors = FALSE) compare_by_group(df1, df2, group_vars = "region")
Compares two datasets at three levels in a single call:
Dataset level – dimensions, column overlap, missing-value totals.
Variable level – column name discrepancies and data-type
mismatches (delegates to compare_variables()).
Observation level – row-by-row value differences on common
columns. Uses positional matching by default, or key-based matching when
id_vars is provided.
The return value is a list with class "dataset_comparison", which has
a tidy print() method. The same object is accepted by
generate_summary_report(), generate_detailed_report(), and
compare_by_group().
compare_datasets(df1, df2, tolerance = 0, vars = NULL, id_vars = NULL)compare_datasets(df1, df2, tolerance = 0, vars = NULL, id_vars = NULL)
df1 |
A data frame (the base dataset). |
df2 |
A data frame (the compare dataset). |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
vars |
Optional character vector of variable names to compare. When provided, only these columns are included in the observation-level comparison. Structural comparison (extra columns, type mismatches) still covers all columns. Default is NULL (compare all common columns). |
id_vars |
Optional character vector of column names to use as matching
keys. When provided, rows are matched by these key columns instead of by
position. This allows comparison of datasets with different row counts or
different row orders. Rows that exist in only one dataset are reported in
|
A dataset_comparison list containing:
nrow_df1, ncol_df1
|
Dimensions of df1. |
nrow_df2, ncol_df2
|
Dimensions of df2. |
common_columns |
Character vector of columns present in both. |
extra_in_df1 |
Columns only in df1. |
extra_in_df2 |
Columns only in df2. |
type_mismatches |
Data frame of columns whose class differs
(columns: |
missing_values |
Data frame summarising NA counts per column per
dataset (columns: |
variable_comparison |
Output of |
observation_comparison |
Output of |
id_vars |
Character vector of key columns used for matching, or
|
unmatched_rows |
List with |
# Positional matching (default) df1 <- data.frame(id = 1:3, val = c(10, 20, 30)) df2 <- data.frame(id = 1:3, val = c(10, 25, 30)) result <- compare_datasets(df1, df2) result # Key-based matching (for different row counts or row orders) df1 <- data.frame(id = c(1, 2, 3), val = c(10, 20, 30)) df2 <- data.frame(id = c(2, 3, 4), val = c(20, 35, 40)) result <- compare_datasets(df1, df2, id_vars = "id") result result$unmatched_rows# Positional matching (default) df1 <- data.frame(id = 1:3, val = c(10, 20, 30)) df2 <- data.frame(id = 1:3, val = c(10, 25, 30)) result <- compare_datasets(df1, df2) result # Key-based matching (for different row counts or row orders) df1 <- data.frame(id = c(1, 2, 3), val = c(10, 20, 30)) df2 <- data.frame(id = c(2, 3, 4), val = c(20, 35, 40)) result <- compare_datasets(df1, df2, id_vars = "id") result result$unmatched_rows
Performs row-by-row comparison of two datasets on common columns, identifying specific value differences at the cell level. Returns discrepancy counts and details showing which rows differ and how their values diverge.
compare_observations(df1, df2, tolerance = 0)compare_observations(df1, df2, tolerance = 0)
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
A list containing discrepancy counts and details of row differences.
df1 <- data.frame(id = 1:3, value = c(1.0, 2.0, 3.0)) df2 <- data.frame(id = 1:3, value = c(1.0, 2.5, 3.0)) compare_observations(df1, df2) compare_observations(df1, df2, tolerance = 0.00001)df1 <- data.frame(id = 1:3, value = c(1.0, 2.0, 3.0)) df2 <- data.frame(id = 1:3, value = c(1.0, 2.5, 3.0)) compare_observations(df1, df2) compare_observations(df1, df2, tolerance = 0.00001)
Scans two directories for matching dataset files, runs cdisc_compare()
on each pair, and optionally generates a consolidated Excel report.
compare_submission( base_dir, compare_dir, format = NULL, id_vars = NULL, tolerance = 0, output_file = NULL )compare_submission( base_dir, compare_dir, format = NULL, id_vars = NULL, tolerance = 0, output_file = NULL )
base_dir |
Path to directory containing base/reference files. |
compare_dir |
Path to directory containing comparison files. |
format |
File format to match: "xpt", "sas7bdat", "csv", or "rds". When NULL (default), auto-detected from the most common file type in base_dir. |
id_vars |
Optional character vector of ID variables (passed to each comparison). When NULL, CDISC-standard keys are auto-detected per domain. |
tolerance |
Numeric tolerance for floating-point comparisons (default 0). |
output_file |
Optional path to Excel (.xlsx) file for consolidated report. |
Named list of cdisc_compare() results, one per matched domain.
## Not run: # Auto-detects format from directory contents results <- compare_submission("v1/", "v2/", output_file = "submission_diff.xlsx") # Explicit format results <- compare_submission("v1/", "v2/", format = "csv") ## End(Not run)## Not run: # Auto-detects format from directory contents results <- compare_submission("v1/", "v2/", output_file = "submission_diff.xlsx") # Explicit format results <- compare_submission("v1/", "v2/", format = "csv") ## End(Not run)
Compares the structural attributes of two datasets including column names, data types, and variable ordering. Identifies common columns and reports columns that exist in only one dataset.
compare_variables(df1, df2)compare_variables(df1, df2)
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
A list containing variable comparison details and discrepancy count.
df1 <- data.frame(id = 1:3, name = c("A", "B", "C")) df2 <- data.frame(id = 1:3, name = c("A", "B", "C"), score = c(90, 80, 70)) compare_variables(df1, df2)df1 <- data.frame(id = 1:3, name = c("A", "B", "C")) df2 <- data.frame(id = 1:3, name = c("A", "B", "C"), score = c(90, 80, 70)) compare_variables(df1, df2)
Detects whether a data frame looks like an SDTM domain or ADaM dataset by comparing column names against known CDISC standards. Calculates a confidence score based on the percentage of expected variables present.
Auto-detection is a convenience for exploratory use. For anything important –
validation reports, regulatory submissions, scripted pipelines – always pass
domain and standard explicitly. Datasets with common columns
(STUDYID, USUBJID, etc.) can match multiple domains, and a warning is issued
when the top two candidates score within 10 percentage points of each other.
detect_cdisc_domain(df, name_hint = NULL)detect_cdisc_domain(df, name_hint = NULL)
df |
A data frame to analyze. |
name_hint |
Optional character string with the dataset name (e.g., "DM", "ADLB", or a filename like "adlb.xpt"). When provided and it matches a known CDISC domain, that candidate receives a strong confidence boost. This makes detection much more accurate when the filename is available. |
A list containing:
standard |
Character: "SDTM", "ADaM", or "Unknown" |
domain |
Character: domain code (e.g., "DM", "AE") or dataset name (e.g., "ADSL"), or NA |
confidence |
Numeric between 0 and 1 indicating match quality |
message |
Character: human-readable explanation |
# Create a sample SDTM DM domain dm <- data.frame( STUDYID = "STUDY001", USUBJID = "SUBJ001", SUBJID = "001", DMSEQ = 1, RACE = "WHITE", ETHNIC = "NOT HISPANIC OR LATINO", ARMCD = "ARM01", ARM = "Treatment A", stringsAsFactors = FALSE ) result <- detect_cdisc_domain(dm) print(result)# Create a sample SDTM DM domain dm <- data.frame( STUDYID = "STUDY001", USUBJID = "SUBJ001", SUBJID = "001", DMSEQ = 1, RACE = "WHITE", ETHNIC = "NOT HISPANIC OR LATINO", ARMCD = "ARM01", ARM = "Treatment A", stringsAsFactors = FALSE ) result <- detect_cdisc_domain(dm) print(result)
Exports a dataset or CDISC comparison result to a file in multiple formats. Automatically detects format from file extension (.html, .txt, .xlsx).
export_report(result, file, format = NULL)export_report(result, file, format = NULL)
result |
A list from |
file |
Character string specifying the output file path. File extension determines format: .html, .txt, or .xlsx. |
format |
Character string specifying output format: "html", "text", or "excel". If NULL (default), format is auto-detected from file extension. |
Supported formats:
HTML (.html): Self-contained HTML report with styling and interactive charts.
Text (.txt): Plain text report suitable for console review.
Excel (.xlsx): Multi-sheet workbook with tabbed data:
"Summary": Dataset dimensions, domain, standard, matching type, tolerance
"Variable Diffs": Metadata attribute differences
"Value Diffs": Unified diff data frame from get_all_differences()
"CDISC Validation": Combined validation results (for CDISC comparisons only)
The result object can be either a dataset_comparison (from compare_datasets())
or cdisc_comparison (from cdisc_compare()). All features are supported for both.
Invisibly returns the input result (useful for piping).
# Create sample datasets df1 <- data.frame( ID = c(1, 2, 3), NAME = c("Alice", "Bob", "Charlie"), AGE = c(25, 30, 35) ) df2 <- data.frame( ID = c(1, 2, 3), NAME = c("Alice", "Bob", "Charles"), AGE = c(25, 30, 36) ) # Compare datasets result <- compare_datasets(df1, df2) # Export to different formats (write to tempdir) export_report(result, file.path(tempdir(), "report.html")) export_report(result, file.path(tempdir(), "report.txt")) # Explicit format specification export_report(result, file.path(tempdir(), "report.xlsx"), format = "excel")# Create sample datasets df1 <- data.frame( ID = c(1, 2, 3), NAME = c("Alice", "Bob", "Charlie"), AGE = c(25, 30, 35) ) df2 <- data.frame( ID = c(1, 2, 3), NAME = c("Alice", "Bob", "Charles"), AGE = c(25, 30, 36) ) # Compare datasets result <- compare_datasets(df1, df2) # Export to different formats (write to tempdir) export_report(result, file.path(tempdir(), "report.html")) export_report(result, file.path(tempdir(), "report.txt")) # Explicit format specification export_report(result, file.path(tempdir(), "report.xlsx"), format = "excel")
Generates a formatted report from the results of cdisc_compare(). Supports both
text-based console output and HTML reports with professional styling and color-coding.
generate_cdisc_report(cdisc_results, output_format = "text", file_name = NULL)generate_cdisc_report(cdisc_results, output_format = "text", file_name = NULL)
cdisc_results |
A list output from |
output_format |
Character string: either "text" (default) for console output or "html" for HTML report. |
file_name |
Optional character string specifying the output file path. For text format, the report is appended to this file. For HTML format, must be explicitly provided by the user. If NULL, output is not written to file. |
The report includes:
Dataset Comparison Summary
CDISC Compliance for each dataset
CDISC Conformance Comparison
For text output, formatting uses console-friendly layout. For HTML output, a self-contained report is generated with color-coded severity levels: red for ERROR, orange for WARNING, blue for INFO.
Invisibly returns the input cdisc_results (useful for piping).
## Not run: # Create sample datasets dm1 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN") ) dm2 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ003"), DMSEQ = c(1, 1), RACE = c("WHITE", "ASIAN") ) result <- cdisc_compare(dm1, dm2, domain = "DM") # Generate text report to console generate_cdisc_report(result, output_format = "text") # Generate HTML report to file out <- file.path(tempdir(), "report.html") generate_cdisc_report(result, output_format = "html", file_name = out) ## End(Not run)## Not run: # Create sample datasets dm1 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN") ) dm2 <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ003"), DMSEQ = c(1, 1), RACE = c("WHITE", "ASIAN") ) result <- cdisc_compare(dm1, dm2, domain = "DM") # Generate text report to console generate_cdisc_report(result, output_format = "text") # Generate HTML report to file out <- file.path(tempdir(), "report.html") generate_cdisc_report(result, output_format = "html", file_name = out) ## End(Not run)
Creates a detailed report outlining all the differences found in the comparison, including variable differences, observation differences, and group-based discrepancies.
generate_detailed_report( comparison_results, output_format = "text", file_name = NULL )generate_detailed_report( comparison_results, output_format = "text", file_name = NULL )
comparison_results |
A list containing the results of dataset comparisons. |
output_format |
Format of the output ('text' or 'html'). |
file_name |
Name of the file to save the report to (applicable for 'html' format). |
The detailed report. For 'text', prints to console. For 'html', writes to file.
## Not run: generate_detailed_report(comparison_results, output_format = "text") ## End(Not run)## Not run: generate_detailed_report(comparison_results, output_format = "text") ## End(Not run)
Provides a summary of the comparison results, highlighting key points such as the number of differing observations and variables.
generate_summary_report( comparison_results, detail_level = "high", output_format = "text", file_name = NULL )generate_summary_report( comparison_results, detail_level = "high", output_format = "text", file_name = NULL )
comparison_results |
A list containing the results of dataset comparisons. |
detail_level |
The level of detail ('high', 'medium', 'low') for the summary. |
output_format |
Format of the output ('text' or 'html'). |
file_name |
Name of the file to save the report to (applicable for 'html' format). |
The summary report. For 'text', prints to console. For 'html', writes to file.
## Not run: generate_summary_report(comparison_results, detail_level = "high", output_format = "text") ## End(Not run)## Not run: generate_summary_report(comparison_results, detail_level = "high", output_format = "text") ## End(Not run)
Converts per-variable observation differences into a single long-format
data frame suitable for filtering with dplyr, writing to CSV, or
programmatic analysis. This is the R equivalent of SAS PROC COMPARE's
OUT= dataset with _TYPE_ and _DIF_ variables.
Accepts output from compare_datasets(), cdisc_compare(), or any list
containing an observation_comparison element with the standard
discrepancies / details / id_details structure.
get_all_differences(comparison_results)get_all_differences(comparison_results)
comparison_results |
A |
A data frame with one row per differing cell. Columns:
Character: column name where the difference was found.
Integer: row index in df1 (positional matching).
The value in df1 (base dataset).
The value in df2 (compare dataset).
Numeric: Base - Compare (NA for character columns).
Numeric: absolute percentage difference relative to Base (NA when Base is 0 or column is character).
When key-based matching was used (id_vars), the ID columns are prepended to the left of the data frame.
Returns an empty data frame with the expected columns when no differences exist or observation comparison was skipped.
df1 <- data.frame(id = 1:3, value = c(10, 20, 30), name = c("A", "B", "C")) df2 <- data.frame(id = 1:3, value = c(10, 25, 30), name = c("A", "B", "D")) result <- compare_datasets(df1, df2) diffs <- get_all_differences(result) head(diffs)df1 <- data.frame(id = 1:3, value = c(10, 20, 30), name = c("A", "B", "C")) df2 <- data.frame(id = 1:3, value = c(10, 25, 30), name = c("A", "B", "D")) result <- compare_datasets(df1, df2) diffs <- get_all_differences(result) head(diffs)
Prepares two datasets for comparison by optionally sorting by specified columns and filtering rows based on a condition.
prepare_datasets(df1, df2, sort_columns = NULL, filter_criteria = NULL)prepare_datasets(df1, df2, sort_columns = NULL, filter_criteria = NULL)
df1 |
First dataset to be prepared. |
df2 |
Second dataset to be prepared. |
sort_columns |
Columns to sort the datasets by. |
filter_criteria |
Criteria for filtering the datasets. |
A list containing two prepared datasets.
df1 <- data.frame(id = c(3, 1, 2), score = c(70, 90, 80)) df2 <- data.frame(id = c(2, 3, 1), score = c(80, 75, 90)) prepare_datasets(df1, df2, sort_columns = "id", filter_criteria = "score > 75")df1 <- data.frame(id = c(3, 1, 2), score = c(70, 90, 80)) df2 <- data.frame(id = c(2, 3, 1), score = c(80, 75, 90)) prepare_datasets(df1, df2, sort_columns = "id", filter_criteria = "score > 75")
Pretty-prints CDISC validation results to the console with a summary and grouped output by category. Displays counts of errors, warnings, and info messages.
print_cdisc_validation(validation_result)print_cdisc_validation(validation_result)
validation_result |
A data frame from |
Output includes:
Summary counts of errors, warnings, and info messages
Issues grouped by category
Each issue displayed with its variable name and message
Invisibly returns the input (useful for piping).
## Not run: # Validate a dataset dm <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN") ) validation_result <- validate_cdisc(dm, domain = "DM", standard = "SDTM") print_cdisc_validation(validation_result) ## End(Not run)## Not run: # Validate a dataset dm <- data.frame( STUDYID = "STUDY001", USUBJID = c("SUBJ001", "SUBJ002"), DMSEQ = c(1, 1), RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN") ) validation_result <- validate_cdisc(dm, domain = "DM", standard = "SDTM") print_cdisc_validation(validation_result) ## End(Not run)
Prints a concise summary of CDISC comparison results. Shows dataset dimensions, domain, number of differences, and a pass/fail verdict based on CDISC validation errors.
## S3 method for class 'cdisc_comparison' print(x, ...)## S3 method for class 'cdisc_comparison' print(x, ...)
x |
A cdisc_comparison object returned by |
... |
Additional arguments (ignored). |
Invisibly returns x.
Print Dataset Comparison Results
## S3 method for class 'dataset_comparison' print(x, ...)## S3 method for class 'dataset_comparison' print(x, ...)
x |
A |
... |
Ignored. |
Invisibly returns x.
Returns a concise one-row data frame summarizing the comparison: domain, standard, row/col counts, number of differences, and CDISC error/warning counts.
## S3 method for class 'cdisc_comparison' summary(object, ...)## S3 method for class 'cdisc_comparison' summary(object, ...)
object |
A cdisc_comparison object returned by |
... |
Additional arguments (ignored). |
A one-row data frame with summary metrics.
Main validation entry point that checks whether a data frame conforms to CDISC standards.
If domain and standard are not provided, they are automatically detected via
detect_cdisc_domain(). Dispatches to validate_sdtm() or validate_adam() as appropriate.
validate_cdisc(df, domain = NULL, standard = NULL)validate_cdisc(df, domain = NULL, standard = NULL)
df |
A data frame to validate. |
domain |
Optional character string specifying the CDISC domain code (e.g., "DM", "AE") or ADaM dataset name (e.g., "ADSL", "ADAE"). If NULL, auto-detected. |
standard |
Optional character string: "SDTM" or "ADaM". If NULL, auto-detected. |
A data frame with columns:
category |
Character: type of validation issue ("Missing Required Variable", "Missing Expected Variable", "Type Mismatch", "Non-Standard Variable", "Variable Info") |
variable |
Character: variable name |
message |
Character: description of the issue |
severity |
Character: "ERROR", "WARNING", or "INFO" |
# Auto-detect domain dm <- data.frame( STUDYID = "STUDY001", USUBJID = "SUBJ001", DMSEQ = 1, RACE = "WHITE", stringsAsFactors = FALSE ) results <- validate_cdisc(dm) print(results) # Validate with explicit domain specification results <- validate_cdisc(dm, domain = "DM", standard = "SDTM")# Auto-detect domain dm <- data.frame( STUDYID = "STUDY001", USUBJID = "SUBJ001", DMSEQ = 1, RACE = "WHITE", stringsAsFactors = FALSE ) results <- validate_cdisc(dm) print(results) # Validate with explicit domain specification results <- validate_cdisc(dm, domain = "DM", standard = "SDTM")