Potts Mixture Models for Protein Residue Direct Coupling Analysis
Direct coupling analysis (DCA) is used to identify coevolving protein residues and understand structural and functional constraints in proteins. In such methods, Potts models parametrize the proteins and using a multiple sequence alignment (MSA); these sequences are assumed to come from a single underlying Potts model. However, MSAs can contain heterogeneity: for example, in an MSA of the murA-lpxC system, only a subset of the seqeunces interact. As an undergraduate researcher in the Marks Lab at Harvard Medical School, I implemented a mixture model with the EVcouplings pipeline using an expectation-maximization algorithm to simultaneously infer class assignments of each sequence in an MSA and the Potts model parameters of each class. However, the algorithm failed to infer clusters correlated with interaction. This work was supported by the Harvard College Research Program (HCRP).