Example Scripts to batch reduce HLA typings from a CSV File
pyard-reduce-csv
command can be used with a config file(that describes ways to reduce the file) to take a
CSV file with HLA typing data and reduce certain columns and produce a new CSV or an Excel file.
Steps on batch processing a CSV file.
- Install
py-ard
- Specify the configuration on how the file should be processed in a JSON
.json
config file. - Run
pyard-reduce-csv -c <config-file>
to produce a processed file based on the configuration in the config file.
To help with creating configuration file, you can use -g
or --generate-sample
option to pyard-reduce-csv
and
generate a sample configuration and a sample CSV file.
These files should be used as a template for your own data.
Once the configuration file is created, use -c
option to specify the configuration file to be used for batch
processing.
In the following example, we generate a sample configuration and CSV file.
$ pyard-reduce-csv --generate-sample
Created sample_reduce_conf.json
Created sample.csv
We specify the config file with -c
and a -q
to suppress verbose log messages.
$ pyard-reduce-csv -c sample_reduce_conf.json -q
Using config file: reduce_conf.json
Failed reducing 'C*02:85:02' in column r_c_typ2
Failed reducing 'DRB1*14:167:01' in column r_drb1_typ2
...
Summary
-------
16 alleles failed to reduce.
| Column Name | Allele | Did you mean ?
| --------------- | ---------------- | -------------------------
| r_c_typ2 | C*02:85:02 | NA
| r_drb1_typ2 | DRB1*14:167:01 | NA
...
Saved result to file:clean_sample.csv.gz
The configuration file provides the following options to modify how the reduction happens.
in_csv_filename
Directory path and file name of the Input CSV file
out_csv_filename
Directory path and file name of the Reduced Output CSV file
columns_from_csv
The column names to read from CSV file
[
"nmdp_id",
"r_a_typ1",
"r_a_typ2",
"r_b_typ1",
"r_b_typ2",
"r_c_typ1",
"r_c_typ2",
"d_a_typ1",
"d_a_typ2",
"d_b_typ1",
"d_b_typ2",
"d_c_typ1",
"d_c_typ2"
]
locus_column_mapping
Mapping of subject types (eg. Recipient, Donor) to their loci and the corresponding columns with
typings for those loci.
The column names corresponding to the loci will be reduced and must appear in the list of columns_from_csv
.
"locus_column_mapping": {
"recipient": {
"A": [
"r_a_typ1",
"r_a_typ2"
],
"B": [
"r_b_typ1",
"r_b_typ2"
],
"C": [
"r_c_typ1",
"r_c_typ2"
]
},
"donor": {
"A": [
"d_a_typ1",
"d_a_typ2"
],
"B": [
"d_b_typ1",
"d_b_typ2"
],
"C": [
"d_c_typ1",
"d_c_typ2"
]
}
}
Instead of providing single locus alleles per column with locus_column_mapping
, a GL String describing the whole
genotype can be provided per column. Use glstring_columns
to provide a list of GL String columns to reduce.
"glstring_columns": [
"donor_gl",
"recip_gl"
],
Depending upon the data, only one of locus_column_mapping
or glstring_columns
needs to be provided.
redux_type
Reduction Type
Valid Options are:
Reduction Type | Description |
---|---|
G |
Reduce to G Group Level |
P |
Reduce to P Group Level |
lg |
Reduce to 2 field ARD level (append g ) |
lgx |
Reduce to 2 field ARD level |
W |
Reduce/Expand to 3 field WHO nomenclature level |
exon |
Reduce/Expand to exon level |
U2 |
Reduce to 2 field unambiguous level |
When processing a large file, it's helpful to cache results of previous reductions, the default is to cache
only 1,000 but this can be increased with the redux_cache_size
option.
"redux_cache_size": 5000,
Pick and choose which of the typings to reduce.
"reduce_serology": false,
"reduce_v2": true,
"convert_v2_to_v3": false,
"reduce_3field": true,
"reduce_P": true,
"reduce_XX": false,
"reduce_MAC": true,
Valid options: true
or false
map_drb345_to_drbx
Map to DRBX Typings based on DRB3, DRB4 and DRB5 typings
using WMDA method.
Valid options: true
or false
locus_in_allele_name
Is locus name present in allele ? E.g. A*01:01
vs 01:01
Valid options: true
or false
keep_locus_in_allele_name
Should the reduced version have locus name present in allele ? E.g. A*01:01
vs 01:01
Valid options: true
or false
new_column_for_redux
Add a separate column for processed column or replace the current column. Creates a reduced_
version of the column. Otherwise, the same column is replaced with the reduced version.
Valid options: true
, false
Specify the prefix for the new column with reduced_column_prefix
.
"reduced_column_prefix": "reduced_",
Generate a GL String column with reduced typings from each subject.
"generate_glstring": true,
Valid options: true
, false
output_file_format
Format of the output file
Valid options: csv
or xlsx
For Excel output, openpyxl
library needs to be installed. Install with:
pip install openpyxl
apply_compression
Compression to use for output file. Applies only to CSV files.
Valid options: 'gzip'
, 'zip'
or null
verbose_log
Show verbose log ?
Valid options: true
or false