logistic回归模型

前言

开始涉及logistic回归,主要解决分类问题,贴一个学到的简单二分类,看书写的代码问题很大,问题比较多,踩坑踩死我……

测试代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import matplotlib.pyplot as plt

with open('data.txt', 'r') as f:
data_list = f.readlines()
data_list = [i.split('\n')[0] for i in data_list]
data_list = [i.split(',') for i in data_list]
data = [(float(i[0]), float(i[1]), float(i[2])) for i in data_list]
data = torch.Tensor(data)
x_data = data[:, 0:2]
y_data = data[:, 2]

class LogisticRegression(nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
self.lr = nn.Linear(2, 1)
self.sm = nn.Sigmoid()

def forward(self, x):
x = self.lr(x)
x = self.sm(x)
return x

logistic_model = LogisticRegression()
loss_fn = nn.BCELoss()
optimizer = optim.SGD(logistic_model.parameters(), lr=1e-3, momentum=0.9)

for epoch in range(50000):
x = Variable(x_data)
y = Variable(y_data.unsqueeze(1))

out = logistic_model(x)
loss = loss_fn(out, y)
print_loss = loss.data.item()
mask = out.ge(0.5).float()
correct = (mask == y).sum()
acc = correct.data.item() / x.size(0)

optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 1000 == 0:
print('*' * 10)
print('epoch{}'.format(epoch + 1))
print('loss is:{:.4f}'.format(print_loss))
print('acc is:{:.4f}'.format(acc))

x0 = list(filter(lambda x: x[-1] == 0.0, data))
x1 = list(filter(lambda x: x[-1] == 1.0, data))
plot_x0_0 = [i[0] for i in x0]
plot_x0_1 = [i[1] for i in x0]
plot_x1_0 = [i[0] for i in x1]
plot_x1_1 = [i[1] for i in x1]

plt.plot(plot_x0_0, plot_x0_1, 'ro', label='x_0')
plt.plot(plot_x1_0, plot_x1_1, 'bo', label='x_1')
plt.legend(loc='best')

w0, w1 = logistic_model.lr.weight[0]
w0 = w0.item()
w1 = w1.item()
b = logistic_model.lr.bias.item()
plot_x = np.arange(30, 100, 0.1)
plot_y = (-w0 * plot_x - b) / w1
plt.plot(plot_x, plot_y)
plt.show()

测试集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
34.62365962451697,78.0246928153624,0
30.2867107622687,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.3855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750265,1
69.07014406283025,52.74046973016765,1
67.9468554771161746,67.857410673128,0

测试结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
**********
epoch5000
loss is:0.5205
acc is:0.8889
**********
epoch10000
loss is:0.4586
acc is:0.8889
**********
epoch15000
loss is:0.4240
acc is:0.8889
**********
epoch20000
loss is:0.4026
acc is:0.8333
**********
epoch25000
loss is:0.3885
acc is:0.7778
**********
epoch30000
loss is:0.3786
acc is:0.7778
**********
epoch35000
loss is:0.3714
acc is:0.7778
**********
epoch40000
loss is:0.3661
acc is:0.7778
**********
epoch45000
loss is:0.3620
acc is:0.7778
**********
epoch50000
loss is:0.3589
acc is:0.7778

效果图

代码分析

一、sigmoid()激活函数

logistic回归最核心的部分在于sigmoid激活函数,实质上这个训练模型很简单,隐藏层只有一层,再在隐藏层上面加一个sigmoid激活函数就输出到输出层了,sigmoid我放个图:

在经过sigmoid激活之后所有输出值都会介于0到1之间,因为sigmoid函数在原点处变化的非常快,并且在向坐标轴两侧传播时会非常快的趋向0和1,因此作为分类问题的激活函数非常合适。

二、BCELoss()二分类交叉熵函数

BCELoss()公式:

在BCELoss()函数中,第一个参数必须在0到1之间,因此一般是要配合sigmoid()函数进行使用。
BCELoss()函数与MSELoss()函数的区别在于BCELoss()能够区别正负数,在计算值相加平均的过程中有些误差会消失,但是MSELoss()不能区别正负数,因此误差不会消失。因此分类问题BCELoss较好。
还有一点,由于sigmoid函数存在在向坐标轴两侧传播时会非常快的趋向0和1的特性,会导致梯度不明显甚至梯度消失的问题,不利于反向传播。在使用交叉熵作为损失函数后,反向传播的梯度不与sigmoid函数的导数有关,这就从一定程度上避免了梯度下降。
但但但但是,在我孱弱短小的数据集下,MSELoss()函数的成绩实现了反超:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
**********
epoch5000
loss is:0.2167
acc is:0.6111
**********
epoch10000
loss is:0.1857
acc is:0.8333
**********
epoch15000
loss is:0.1663
acc is:0.8889
**********
epoch20000
loss is:0.1538
acc is:0.8889
**********
epoch25000
loss is:0.1453
acc is:0.8889
**********
epoch30000
loss is:0.1392
acc is:0.8889
**********
epoch35000
loss is:0.1348
acc is:0.8889
**********
epoch40000
loss is:0.1314
acc is:0.8889
**********
epoch45000
loss is:0.1288
acc is:0.8889
**********
epoch50000
loss is:0.1267
acc is:0.8889

效果图:

实质上,在BCELoss()函数的训练过程中accuracy曾达到过0.8889:
1
2
3
4
5
6
7
8
9
10
11
12
**********
epoch5000
loss is:0.5468
acc is:0.7778
**********
epoch10000
loss is:0.4722
acc is:0.8889
**********
epoch15000
loss is:0.4320
acc is:0.8889

效果图

但它的误差要大得多,因此这个成绩被“修正”了……

具体原因我还分析不出来,估计是测试集样本太少的原因。

三、41行的ge()函数

这是在python3中出现的新函数
lt(a, b) 相当于 a < b
le(a,b) 相当于 a <= b
eq(a,b) 相当于 a == b
ne(a,b) 相当于 a != b
gt(a,b) 相当于 a > b
ge(a, b)相当于 a>= b
41行代码的功能是如果大于0.5就返回1,反之返回0。返回值是0和1而不是bool类型。

四、偷懒不存在的

matplotlib()图像我是拷贝的,具体实现不太清楚,待我仔细研究研究有空单独写一写。