How to resolve variability in results due to different random seeds?

Thanks for creating this tool! The workflow is clear and easy to follow.

I've noticed some concerning variability when I set different random seeds. I understand from #53 that preProcSample randomly selects SNPs from the input matrix, which affects downstream results. I ran 10 tests on the same SNP matrix file with seeds set from 1 to 10 and collected summary statistics after running the whole workflow. The results are below:

`   n_snps n_hets   pct_hets n_segs    dipLogR    purity   ploidy`
`1  422984  33283 0.07868619    386 -0.8568592 0.8300489 3.954321`
`2  422849  33294 0.07873733    350 -0.9517581 0.8853103 4.110511`
`3  422796  33265 0.07867861    368 -0.9274981 0.8898704 4.027205`
`4  422947  33281 0.07868835    356 -0.6327828 0.9080758 3.212570`
`5  422948  33294 0.07871890    335 -0.9326662 0.8872234 4.048639`
`6  422914  33272 0.07867321    363 -0.9533264 0.8823996 4.122240`
`7  422891  33281 0.07869877    357 -0.9079667 0.9072686 3.931950`
`8  422910  33294 0.07872597    358 -0.6944477 0.9000032 3.373917`
`9  422837  33274 0.07869226    352 -0.7299356 0.9003956 3.462835`
`10 422731  33292 0.07875457    351 -0.9501981 0.8891565 4.096679`

The minimal variation in the first 4 columns makes sense to me as a result of randomizing the SNP selection. However, the range of values in dipLogR, purity, and ploidy are somewhat concerning (similar to the results noted in #59). The range of ploidy values is hardest to interpret; I'm not sure how confident to be in any given result when they vary so much.

My instinct is to use these random tests to infer a distribution of values for purity and ploidy, and report the mean or median as the most likely result. Any other suggestions for rationalizing these differences?

On the surface, these ranges may not seem that significant, but I'm also noticing pretty big differences in the CN/CF plots with different random seeds (e.g., large chromosomal segments being called at very disparate CNs across different iterations). I can share these plots if that would be helpful. The resulting spider plots are pretty similar overall; does it make sense to use those to select the "best" fit and consider that iteration the most likely "true" result for downstream analysis? Just want to make sure I'm making accurate inferences in light of this apparent variability.

Any insights would be greatly appreciated! Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to resolve variability in results due to different random seeds? #201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to resolve variability in results due to different random seeds? #201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions