*http://spssx-discussion.1045642.n5.nabble.com/Analyzing-Sequences-help-with-LOOPS-amp-VECTORS-td5729034.html. *XOXO subsequences. DATASET CLOSE ALL. OUTPUT CLOSE ALL. BEGIN PROGRAM Python. from datetime import datetime begin = datetime.now() print begin END PROGRAM. *****************************************. *Simulate 100,000 sequences of length 20. INPUT PROGRAM. LOOP #i = 1 TO 1e5. COMPUTE SEQNO = #i. END CASE. END LOOP. END FILE. END INPUT PROGRAM. DATASET NAME Test. STRING S (A20). LOOP #i = 1 TO 20. DO IF (RV.BERNOULLI(0.7) = 0). COMPUTE S = CONCAT(RTRIM(S),"O"). ELSE. COMPUTE S = CONCAT(RTRIM(S),"X"). END IF. END LOOP. EXECUTE. *****************************************. *******************************************************. *This splits the string into all potential substrings. VECTOR StrSub(210,A20). COMPUTE #Ord = 1. LOOP #i = 1 TO 20. LOOP #j = 1 TO (21-#i). COMPUTE StrSub(#Ord) = CHAR.SUBSTR(S,#j,#i). COMPUTE #Ord = #Ord + 1. END LOOP. END LOOP. VARSTOCASES /MAKE StrSub FROM StrSub1 TO StrSub210. *This works fine for the 1 million cases. *******************************************************. *******************************************************. *Above works fine for 1 million cases, below is the bottleneck though. *My machine runs out of memory when doing the aggregate for ~over 30 million. *Not sure for memory management if better to sort and use PRESORTED or no sort. DATASET DECLARE Agg. AGGREGATE OUTFILE='Agg' /BREAK SEQNO StrSub /NSub = N. DATASET ACTIVATE Agg. *Use CASESTOVARS here if you want to calculate correlation between different substrings within. *With each unique SEQNO as an observation. *******************************************************. *******************************************************. *This counts population wise. DATASET DECLARE AggAll. AGGREGATE OUTFILE='AggAll' /BREAK StrSub /TotalSub = SUM(NSub) /TotalSeq = N. DATASET ACTIVATE AggAll. *******************************************************. *******************************************************. *Now need to make a dataset of all the potential permutations. *Only need to do this once. *See http://spssx-discussion.1045642.n5.nabble.com/How-to-enumerate-all-the-combinations-in-SPSS-td5727284.html. *For several different ways to do this. DEFINE !Combinations (Set = !TOKENS(1) /Len = !TOKENS(1) ) INPUT PROGRAM. !LET !Str = " ". !LET !LisVar = "". !DO !I = 1 !TO !Len !LET !Ind = !CONCAT("#",!LENGTH(!Str)) LOOP !Ind = 1 TO !Set. !LET !Str = !CONCAT(!Str," ") !LET !LisVar = !CONCAT(!LisVar," ",!Ind) !DOEND VECTOR X(!Len). DO REPEAT L = !LisVar /X = X1 TO !CONCAT("X",!Len). COMPUTE X = L. END REPEAT. END CASE. !DO !I = 1 !TO !Len END LOOP. !DOEND END FILE. END INPUT PROGRAM. DATASET NAME Comb. EXECUTE. !ENDDEFINE. DEFINE !SubComb2 (!POSITIONAL = !TOKENS(1) ) *Make base dataset. !Combinations Set = 2 Len = 1. DATASET ACTIVATE Comb. DATASET NAME Base. !DO !I = 2 !TO !1 !Combinations Set = 2 Len = !EVAL(!I). DATASET ACTIVATE Base. ADD FILES FILE = * /FILE = 'Comb'. DATASET CLOSE Comb. !DOEND !ENDDEFINE. !SubComb2 20. DATASET ACTIVATE Base. STRING StrSub (A20). DO REPEAT X = X1 TO X20. DO IF X = 1. COMPUTE StrSub = CONCAT(RTRIM(StrSub),"O"). ELSE. COMPUTE StrSub = CONCAT(RTRIM(StrSub),"X"). END IF. END REPEAT. *******************************************************. *******************************************************. *Add in observed permutations. SORT CASES BY StrSub. MATCH FILES FILE = * /FILE = 'AggAll' /BY StrSub /DROP X1 TO X20. RECODE TotalSub TotalSeq (SYSMIS = 0)(ELSE = COPY). *******************************************************. *Takes alittle over 4 minutes on my machine for 100,000 cases. BEGIN PROGRAM Python. end = datetime.now() print begin print end print (end-begin) END PROGRAM. *Now need to make a dataset of all the potential permutations. *Using Python, this is taking forever. * DATASET DECLARE AllPerm. * BEGIN PROGRAM Python. * import spss import itertools as it * #opening data and appending variables spss.StartDataStep() datasetObj = spss.Dataset(name='AllPerm') datasetObj.varlist.append('StrSub',20) * #now adding rows YourSet = ['X','O'] NG = range(1,21) * for N in NG: x = it.product(YourSet,repeat=N) for i in x: b = "".join(i) datasetObj.cases.append([b]) * spss.EndDataStep() END PROGRAM. * DATASET ACTIVATE AllPerm.