Loops are Hard [Python]
An answer to this question on Stack Overflow.
Question
Not a programmer by vocation, please excuse if this is obvious. I cannot loop :/ ...
I have 3 lists:
gene_concepts[0] = ['+0|+77|CFTR', '+12|+77|CYP2C19']
genes = ['CFTR', 'CFTR', 'CFTR', 'CFTR', 'CFTR', 'CFTR', 'CFTR',
'CFTR', 'CFTR', 'CFTR', 'CFTR', 'CFTR', 'CYP2C19', 'CYP2C19',
'CYP2C19', 'CYP2C19', 'CYP2C19', 'CYP2C19', 'CYP2C19', 'CYP2C19']
haplotypes = ['CFTR F508del(CTT)', 'CFTR F508del(TCT)', 'CFTR G1244E',
'CFTR G1349D', 'CFTR G178R', 'CFTR G551D', 'CFTR G551S', 'CFTR S1251N',
'CFTR S1255P', 'CFTR S549N', 'CFTR S549R(A>C)', 'CFTR S549R(T>G)',
'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19
*10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10']
Note that the haplotypes and genes match (i.e., the first term of the string in the haplotype list is "CFTR", and matches the first element of the list in genes list...so these are ordered)
I want to build a new list or just output a set of strings, such that the haplotypes that have the same gene (so the genes may match each other, or the substring of the first part of the haplotype string, whichever) are assigned a particular code, which is found in the gene_concepts list, and corresponds to the first term before the "|" delimiter in the list of strings.
Output desired is:
+21|+0|CFTR F508del(CTT)
+22|+0|CFTR F508del(TCT)
+23|+0|CFTR G1244E
+24|+0|CFTR G1349D
+25|+0|CFTR G178R
+26|+0|CFTR G551D
+27|+0|CFTR G551S
+28|+0|CFTR S1251N
+29|+0|CFTR S1255P
+30|+0|CFTR S549N
+31|+0|CFTR S549R(A>C)
+32|+0|CFTR S549R(T>G)
+33|+12|CYP2C19 *10
+34|+12|CYP2C19 *10
+35|+12|CYP2C19 *10
+36|+12|CYP2C19 *10
+37|+12|CYP2C19 *10
+38|+12|CYP2C19 *10
+39|+12|CYP2C19 *10
+40|+12|CYP2C19 *10
So the first part of the text above is "+21...+39 is the temp_code_2"...this is just an arbitrary id I've assigned it to keep track. The part between the delimiters is the code I'm trying to assign the matching genes. The last part after the 2nd delimiter is the haplotype.
Here's my code so far...
def generate_haplotype_concepts(gene_concepts[0], haplotypes):
temp_code_2 = 20
index = 0
for batch_line in gene_concepts[0]:
gene_parent_code = batch_line.split("|")[0]
gene_parent_medcodes.append(gene_parent_code)
index_gene = 0
index_parent_code = 0
for gene in genes:
if (index_gene == 0):
print("+" + str(temp_code_2) + "|"
+ gene_parent_medcodes[index_parent_code] + "|"
+ haplotypes[index_gene])
index_gene += 1
elif (genes[index_gene] == genes[index_gene-1]):
print("+" + str(temp_code_2) + "|"
+ gene_parent_medcodes[index_parent_code] + "|"
+ haplotypes[index_gene-1])
else:
index_parent_code += 1
print("+" + str(temp_code_2) + "|"
+ gene_parent_medcodes[index_parent_code] + "|"
+ haplotypes[index_gene])
index_gene += 1
temp_code_2 += 1
generate_haplotype_concepts(gene_concepts[0], haplotypes)
My output is this:
+21|+0|CFTR F508del(CTT)
+22|+0|CFTR F508del(TCT)
+23|+0|CFTR G1244E
+24|+0|CFTR G1349D
+25|+0|CFTR G178R
+26|+0|CFTR G551D
+27|+0|CFTR G551S
+28|+0|CFTR S1251N
+29|+0|CFTR S1255P
+30|+0|CFTR S549N
+31|+0|CFTR S549R(A>C)
+32|+12|CYP2C19 *10
+33|+12|CYP2C19 *10
+34|+12|CYP2C19 *10
+35|+12|CYP2C19 *10
+36|+12|CYP2C19 *10
+37|+12|CYP2C19 *10
+38|+12|CYP2C19 *10
+39|+12|CYP2C19 *10
2 problems I see...I'm missing the last CFTR haplotype (+32|+0|CFTR S549R(T>G) should be there instead) and I'm getting a "list index out of range" error.
-----------------------------------------------------------------------
----
IndexError Traceback (most recent call
last)
<ipython-input-16-1410b2513457> in <module>()
55
56
---> 57 generate_haplotype_concepts(gene_concepts[0], haplotypes)
<ipython-input-16-1410b2513457> in
generate_haplotype_concepts(temp_code_2, haplotypes)
30 # + "\n" )
31 index_gene += 1
---> 32 elif (genes[index_gene] == genes[index_gene-1]):
33 print("+" + str(temp_code_2) + "|"
34 + gene_parent_medcodes[index_parent_code] +
"|"
IndexError: list index out of range
Apologies for any typos I've made...I've tried posting simpler code than what I'm actually doing but the issue is the same...any help is appreciated!
Answer
The following may be helpful (note the importance of checking for unexpected conditions):
haplotypes = ['CFTR F508del(CTT)', 'CFTR F508del(TCT)', 'CFTR G1244E', 'CFTR G1349D', 'CFTR G178R', 'CFTR G551D', 'CFTR G551S', 'CFTR S1251N', 'CFTR S1255P', 'CFTR S549N', 'CFTR S549R(A>C)', 'CFTR S549R(T>G)', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10', 'CYP2C19 *10']
gene_concepts = {'CFTR':0, 'CYP2C19':12} #Dictionaries are useful
for x in haplotypes:
prefix = x.split()[0] #Get prefix by splitting on spaces and looking at substring before first space
if prefix in gene_concepts: #Do we recognize this gene concept?
print("{0}|{1}".format(gene_concepts[prefix],x))
else: #If not, inform the user
print('Gene with unknown concept: "{0}"'.format(x))
Gives output:
0|CFTR F508del(CTT)
0|CFTR F508del(TCT)
0|CFTR G1244E
0|CFTR G1349D
0|CFTR G178R
0|CFTR G551D
0|CFTR G551S
0|CFTR S1251N
0|CFTR S1255P
0|CFTR S549N
0|CFTR S549R(A>C)
0|CFTR S549R(T>G)
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
12|CYP2C19 *10
Which may not be exactly what you are looking for but is, I think, closer. By changing the values in the dictionary you should be able to achieve what you want.