*http://spssx-discussion.1045642.n5.nabble.com/Analyzing-Sequences-help-with-LOOPS-amp-VECTORS-td5729034.html.
*XOXO subsequences.
DATASET CLOSE ALL.
OUTPUT CLOSE ALL.

BEGIN PROGRAM Python.
from datetime import datetime
begin =  datetime.now()
print begin
END PROGRAM.


*****************************************. 
*Simulate 100,000 sequences of length 20. 
INPUT PROGRAM. 
LOOP #i = 1 TO 1e5. 
  COMPUTE SEQNO = #i. 
  END CASE. 
END LOOP. 
END FILE. 
END INPUT PROGRAM. 
DATASET NAME Test. 
STRING S (A20).
LOOP #i = 1 TO 20.
  DO IF (RV.BERNOULLI(0.7) = 0).
    COMPUTE S = CONCAT(RTRIM(S),"O").
  ELSE.
    COMPUTE S = CONCAT(RTRIM(S),"X").
  END IF.
END LOOP.
EXECUTE. 
*****************************************. 

*******************************************************.
*This splits the string into all potential substrings.
VECTOR StrSub(210,A20).
COMPUTE #Ord = 1.
LOOP #i = 1 TO 20.
  LOOP #j = 1 TO (21-#i).
    COMPUTE StrSub(#Ord) = CHAR.SUBSTR(S,#j,#i).
    COMPUTE #Ord = #Ord + 1.
  END LOOP.
END LOOP.
VARSTOCASES /MAKE StrSub FROM StrSub1 TO StrSub210.
*This works fine for the 1 million cases.
*******************************************************.

*******************************************************.
*Above works fine for 1 million cases, below is the bottleneck though.
*My machine runs out of memory when doing the aggregate for ~over 30 million.
*Not sure for memory management if better to sort and use PRESORTED or no sort.
DATASET DECLARE Agg.
AGGREGATE OUTFILE='Agg'
  /BREAK SEQNO StrSub
  /NSub = N.
DATASET ACTIVATE Agg.
*Use CASESTOVARS here if you want to calculate correlation between different substrings within.
*With each unique SEQNO as an observation.
*******************************************************.

*******************************************************.
*This counts population wise.
DATASET DECLARE AggAll.
AGGREGATE OUTFILE='AggAll'
  /BREAK StrSub
  /TotalSub = SUM(NSub)
  /TotalSeq = N.
DATASET ACTIVATE AggAll.
*******************************************************.


*******************************************************.
*Now need to make a dataset of all the potential permutations.
*Only need to do this once.

*See http://spssx-discussion.1045642.n5.nabble.com/How-to-enumerate-all-the-combinations-in-SPSS-td5727284.html.
*For several different ways to do this.
DEFINE !Combinations (Set = !TOKENS(1) 
                     /Len = !TOKENS(1) ) 
INPUT PROGRAM. 
!LET !Str = " ". 
!LET !LisVar = "". 
!DO !I = 1 !TO !Len 
  !LET !Ind = !CONCAT("#",!LENGTH(!Str)) 
  LOOP !Ind = 1 TO !Set. 
  !LET !Str = !CONCAT(!Str," ") 
  !LET !LisVar = !CONCAT(!LisVar," ",!Ind) 
!DOEND 
VECTOR X(!Len). 
DO REPEAT L = !LisVar /X = X1 TO !CONCAT("X",!Len). 
  COMPUTE X = L. 
END REPEAT. 
END CASE. 
!DO !I = 1 !TO !Len 
  END LOOP. 
!DOEND 
END FILE. 
END INPUT PROGRAM. 
DATASET NAME Comb. 
EXECUTE. 
!ENDDEFINE. 

DEFINE !SubComb2 (!POSITIONAL = !TOKENS(1) )
*Make base dataset.
!Combinations Set = 2 Len = 1.
DATASET ACTIVATE Comb.
DATASET NAME Base.

!DO !I = 2 !TO !1
  !Combinations Set = 2 Len = !EVAL(!I).
  DATASET ACTIVATE Base.
  ADD FILES FILE = * /FILE = 'Comb'.
  DATASET CLOSE Comb.
!DOEND
!ENDDEFINE.

!SubComb2 20.
DATASET ACTIVATE Base.
STRING StrSub (A20).
DO REPEAT X = X1 TO X20.
  DO IF X = 1.
    COMPUTE StrSub = CONCAT(RTRIM(StrSub),"O").
  ELSE.
    COMPUTE StrSub = CONCAT(RTRIM(StrSub),"X").
  END IF.
END REPEAT.
*******************************************************.

*******************************************************.
*Add in observed permutations.
SORT CASES BY StrSub.
MATCH FILES FILE = *
  /FILE = 'AggAll'
  /BY StrSub
  /DROP X1 TO X20.
RECODE TotalSub TotalSeq (SYSMIS = 0)(ELSE = COPY).
*******************************************************.


*Takes alittle over 4 minutes on my machine for 100,000 cases.
BEGIN PROGRAM Python.
end =  datetime.now()
print begin 
print end
print (end-begin)
END PROGRAM.

*Now need to make a dataset of all the potential permutations.
*Using Python, this is taking forever.
 * DATASET DECLARE AllPerm.
 * BEGIN PROGRAM Python.
 * import spss
import itertools as it

 * #opening data and appending variables
spss.StartDataStep()
datasetObj = spss.Dataset(name='AllPerm')
datasetObj.varlist.append('StrSub',20)

 * #now adding rows
YourSet = ['X','O']
NG = range(1,21)

 * for N in NG:
  x = it.product(YourSet,repeat=N)
  for i in x:
    b = "".join(i)
    datasetObj.cases.append([b])

 * spss.EndDataStep()
END PROGRAM.
 * DATASET ACTIVATE AllPerm.