I imagine @SebKrantz has considered this, so this question is mainly out of curiosity:
Is there a reason that multithreading is not possible for fvar() and fsd()?
For example, using Chan's parallel algorithm to combine single pass results computed on separate parts of x?