Replies: 2 comments 1 reply
-
|
I am thinking about using a graph database to accommodate the entity centric approach that Dr. Brent brought up -- each regulator/experiment can be a node and the p-val/effective-val can be an edge between two nodes e.g. a regulator and a gene. This seems to make sense for what Dr. Brent prefers to have. |
Beta Was this translation helpful? Give feedback.
-
|
First method of filtering will be dataset centric. The user needs to know what data is available. Remember that each repository might store more than 1 dataset. All are classified by type. Note that this typing and naming is meant for the developer -- we might need to re-name for a user. For the time being, just use these types directly. Each of the datasets has metadata. Some of the datasets in a repo have the same metadata (eg maybe an annotated feature dataset and a genome map dataset share the same metadata) and sometimes they have different metadata (Mahendrawada has the chec data and RNAseq data in the same repo, and the metadata are unique to each of those). At the first level, maybe the user selects what type of data they want to look at (eg, annotated feature). Then, they would select from the set of binding data, and the set of perturbation data. It shouldn't be required that they select one of both. After that, there will be a set of fields that are available from all of the datasets. Those should now be available for the user to select among. Among those fields will be quantitative metrics, eg rank response and/or DTO. Those are both calculated between a binding dataset and a perturbation dataset, so this type of filter should only be available if at least 1 binding dataset and at least 1 perturbation dataset is available. Note -- these data aren't on hugging face yet because I'm not sure how to add it. It might go as its own repository, though we don't have a eg datatype specification for something like that yet. (note: right now if we get to this point, then we may need to mock this data. I'm not sure when I'm going to be able to figure out how to actually add this). I believe that everything you need about the collection as a whole, each repo and the individual datasets in each repo can already be retrieved from functions which return metadata at the appropriate level (meaning at the repo level, or at the dataset level, depending on what you want to know) with hfqueryapi. See the hfqueryapi tutorial |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
in the meeting yesterday, michael was thinking about the data differently than I do. My conceptual model is typically as a database where each dataset is table.
Michael had a more computer science-y approach, which is to think about what each actual entity is. In this case, each entity is a single experiment on a single regulator (we frequently say transcription factor, but not all of the proteins in all of the datasets are actually DNA binding proteins, so "regulator" is meant to be broad enough to describe all of the proteins).
That means that each item is basically a set of specifications, eg a calling cards entity (repo: callingcards, dataset: annotated_features, regulator_locus_tag: CBF1, batch: run_1234, replicate: 1, ... ) and a mcisaac entity (repo: hackett_2020, dataset: hackett_2020, regulator_locus_tag, CBF1, mechanism: ZEV, restriction: P, time: 15, ... ).
Thinking about it this way makes me think that possibly there is a datastructure we could use on this set that is different than more SQL/table centric structure that I have been imagining.
I don't want to see code for this, but I'd love to have discussion on this -- diagrams, pictures, etc would help
Beta Was this translation helpful? Give feedback.
All reactions