Michael's suggestion on how to conceptualize the data #32

cmatKhan · 2025-09-26T18:00:31Z

cmatKhan
Sep 26, 2025
Maintainer

in the meeting yesterday, michael was thinking about the data differently than I do. My conceptual model is typically as a database where each dataset is table.

Michael had a more computer science-y approach, which is to think about what each actual entity is. In this case, each entity is a single experiment on a single regulator (we frequently say transcription factor, but not all of the proteins in all of the datasets are actually DNA binding proteins, so "regulator" is meant to be broad enough to describe all of the proteins).
That means that each item is basically a set of specifications, eg a calling cards entity (repo: callingcards, dataset: annotated_features, regulator_locus_tag: CBF1, batch: run_1234, replicate: 1, ... ) and a mcisaac entity (repo: hackett_2020, dataset: hackett_2020, regulator_locus_tag, CBF1, mechanism: ZEV, restriction: P, time: 15, ... ).
Thinking about it this way makes me think that possibly there is a datastructure we could use on this set that is different than more SQL/table centric structure that I have been imagining.

I don't want to see code for this, but I'd love to have discussion on this -- diagrams, pictures, etc would help

MackLiao · 2025-09-29T21:04:35Z

MackLiao
Sep 29, 2025
Collaborator

I am thinking about using a graph database to accommodate the entity centric approach that Dr. Brent brought up -- each regulator/experiment can be a node and the p-val/effective-val can be an edge between two nodes e.g. a regulator and a gene. This seems to make sense for what Dr. Brent prefers to have.

1 reply

cmatKhan Sep 30, 2025
Maintainer Author

each regulator/experiment can be a node and the p-val/effective-val can be an edge between two nodes e.g. a regulator can be a node ...

The data is so much more complicated than this, unfortunately. At this point, we don't need to even worry about pvalues or effects. Remember there are three levels of data that we're dealing with:

the repo, which has metadata in the form of the dataset card
metadata on the datasets contained in each repo
the actual underlying data, which might be one of 4 types (see the DatasetType model)

The 'nodes' that we're considering right now are nodes consisting of a combination of data from the dataset card and the dataset's metadata. effects/pvalues are properties of the 3rd type of data and would not be in these nodes.

a graph structure on the nodes that describe each "sample" is not a bad idea. The fields in the node describe the 'condition', among other properties, of the sample. That condition is unfortunately very heterogeneous and can be complex.

The "sample" that node would refer to would contain depth over genomic locations (genome_map), or pval/effect (annotated_feature), but those particular fields are not important in terms of organizing the samples

cmatKhan · 2025-10-16T21:45:34Z

cmatKhan
Oct 16, 2025
Maintainer Author

First method of filtering will be dataset centric.

The user needs to know what data is available. Remember that each repository might store more than 1 dataset. All are classified by type. Note that this typing and naming is meant for the developer -- we might need to re-name for a user. For the time being, just use these types directly.

Each of the datasets has metadata. Some of the datasets in a repo have the same metadata (eg maybe an annotated feature dataset and a genome map dataset share the same metadata) and sometimes they have different metadata (Mahendrawada has the chec data and RNAseq data in the same repo, and the metadata are unique to each of those).

At the first level, maybe the user selects what type of data they want to look at (eg, annotated feature).

Then, they would select from the set of binding data, and the set of perturbation data. It shouldn't be required that they select one of both.

After that, there will be a set of fields that are available from all of the datasets. Those should now be available for the user to select among.

Among those fields will be quantitative metrics, eg rank response and/or DTO. Those are both calculated between a binding dataset and a perturbation dataset, so this type of filter should only be available if at least 1 binding dataset and at least 1 perturbation dataset is available. Note -- these data aren't on hugging face yet because I'm not sure how to add it. It might go as its own repository, though we don't have a eg datatype specification for something like that yet. (note: right now if we get to this point, then we may need to mock this data. I'm not sure when I'm going to be able to figure out how to actually add this).

I believe that everything you need about the collection as a whole, each repo and the individual datasets in each repo can already be retrieved from functions which return metadata at the appropriate level (meaning at the repo level, or at the dataset level, depending on what you want to know) with hfqueryapi. See the hfqueryapi tutorial

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Michael's suggestion on how to conceptualize the data #32

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Michael's suggestion on how to conceptualize the data #32

Uh oh!

cmatKhan Sep 26, 2025 Maintainer

Replies: 2 comments · 1 reply

Uh oh!

MackLiao Sep 29, 2025 Collaborator

Uh oh!

Uh oh!

cmatKhan Sep 30, 2025 Maintainer Author

Uh oh!

Uh oh!

cmatKhan Oct 16, 2025 Maintainer Author

cmatKhan
Sep 26, 2025
Maintainer

Replies: 2 comments 1 reply

MackLiao
Sep 29, 2025
Collaborator

cmatKhan Sep 30, 2025
Maintainer Author

cmatKhan
Oct 16, 2025
Maintainer Author