[問題] 如何才能一直使用原始數據計算cluster?

作者: piacere (Beol)   2023-04-18 14:06:07
各位先進們好:
小妹有一個code是要用KNN計算向量相似度之後cluster
但發現他都會與最近點cluster完就會用新的平均座標去計算次近點
有時本來1跟3也是相似的,但1先與2 cluster完後就反而與3不相似了(以及資料順序也會有差QQ)
想請教該怎麼修改才能使每個i點都先用原始座標計算完相近的K點,全部資料都計算完後再一起cluster+平均座標呢?QQ
幾個可能有問題的function如下:
def KNN(i, k, data_mid_point, tree):
dist, ind = tree.query(np.expand_dims(data_mid_point[i], axis=0), k=k+1)
nearest_ids = list(ind[0])
if i in nearest_ids:
nearest_ids.remove(i)
else:
nearest_ids = nearest_ids[:-1]
distances = []
for j in nearest_ids:
distance = ((data_mid_point[j][0] - data_mid_point[i][0])**2 +
(data_mid_point[j][1] - data_mid_point[i][1])**2)**0.5
distances.append(distance)
print(f"The {k} nearest IDs to ID {i} are:")
for j in range(len(nearest_ids)):
print(f"ID: {nearest_ids[j]}, Distance: {(distances[j]/0.000009)} meters")
return nearest_ids
def calcClusterFlow(c, data):
ox = 0
oy = 0
dx = 0
dy = 0
for k in c:
ox += data[k][0]*data[k][8]
oy += data[k][1]*data[k][8]
dx += data[k][2]*data[k][8]
dy += data[k][3]*data[k][8]
d = 0
for k in c:
d += data[k][8]
ox /= d
oy /= d
dx /= d
dy /= d
return ox, oy, dx, dy
#計算相似性
def flowSim(vi, vj, alpha):
leni = math.sqrt((vi[0]**2+vi[1]**2))
lenj = math.sqrt((vj[0]**2+vj[1]**2))
dv = math.sqrt((vi[0] - vj[0]) ** 2 + (vi[1] - vj[1]) ** 2)
if leni > lenj:
return dv/(alpha*leni)
else:
return dv/(alpha*lenj)
#計算clusterID為ci和cj的兩個cluster的相似性
def clusterSim(i, j, ci, cj, data, alpha):
oix, oiy, dix, diy = data[ci[0]][4], data[ci[0]][5], data[ci[0]][6],
data[ci[0]][7]
ojx, ojy, djx, djy = data[cj[0]][4], data[cj[0]][5], data[cj[0]][6],
data[cj[0]][7]
vi = [dix-oix, diy-oiy]
vj = [djx-ojx, djy-ojy]
sim = flowSim(vi, vj, alpha)
return sim
#合併兩個clusters
def merge(c, ci_ID, cj_ID, l):
#保留小數字的clusterID
if ci_ID > cj_ID :
ci_ID, cj_ID = cj_ID, ci_ID
for l_ID in c[cj_ID]:
l[l_ID] = ci_ID
c[ci_ID].append(l_ID)
c.pop(cj_ID)
算式在這邊:
for i in tqdm(range(dataLen)):
neighbors = KNN(i, K, data_mid_point, tree)
for j in neighbors:
if (data_mid_point[i][0]-data_mid_point[j][0])**2+(data_mid_point[i][1]-data_mid_point[j][1])**2>(Radius*0.000009)**2:
continue
if l[i] != l[j]:
if clusterSim(i, j, c[l[i]], c[l[j]], data, alpha) <= 1:
new_cluster_ID = min(l[i],l[j])
num_of_flow_in_cluster=0
merge(c, l[i], l[j], l)
for m in c[new_cluster_ID]:
num_of_flow_in_cluster+=data[m][8]
for m in c[new_cluster_ID]:
cox, coy, cdx, cdy = calcClusterFlow(c[new_cluster_ID],data)
data[m][4], data[m][5], data[m][6], data[m][7], data[m][9] = cox, coy, cdx, cdy, num_of_flow_in_cluster
目前感覺比較有問題的應該是merge那裡,問了chatGPT但好像也不太能理解我想要的結果
再請各位幫幫忙,感激不盡QQ
作者: wuyiulin (龍破壞劍士-巴斯達布雷達)   2023-04-18 16:17:00
假設你拿X1點做KNN,拿到第一層 x_1j 們,你要存 x_1j們的座標傳下去做第二層。所以可能是哪裡有 mean 把它幹掉調整一下就好了。然後為什麼你用 queue 實現…怪怪的。
作者: piacere (Beol)   2023-04-18 20:40:00
樓上大大,感謝您的回答但我看不懂....我現在就是抓不出來他哪裡cluster後把座標也merge了TT對了我有用ball-tree唷
作者: wuyiulin (龍破壞劍士-巴斯達布雷達)   2023-04-19 08:03:00
我講的是 Brute,如果是 ball-tree 我要想一下

Links booklink

Contact Us: admin [ a t ] ucptt.com