召喚強手
假設有10000句phase, eg "coffee workshop", "cat owner meeting" 要搵相似既phase做cluster
Step 1: English normalisation, 將meeting, met 變做meet. ..
Step2: for each unique normalized word, get a dictionary of occurrence. "meet":6%, "coffee" 0.5%, workshop 5%....
Step3: build a co-exist-probability matrix for the phases.
"Cat ownership talk" vs "cat owner meeting" = prob of "cat" * prob of "own"
..........
Step4: build a graph, each phase is a node, if the co-exist-probability of two node is smaller than x%, build an edge between the two node.
Step 5: use graph algorithm to find all cluster.
會唔會做多左,定係已經有library做緊呢D