最近诸事不顺,情绪不佳。继续做文本生成的事情。之前用的Char-RNN存在一定的缺陷,那就是你需要给定一个prefix,然后模型就会顺着prefix开始一个个往下预测。但是这样生成的文本随机性是很大的,所以我们希望能够让句子根据我们的关键词或者topic来生成。看了几篇论文,大框架上都是基于Attention的,其他的都是一些小的细节变化。这里打算实现两篇论文里的框架,一篇是哈工大的Topic-to-Essay Generation with Neural Networks ,另一篇是百度的Chinese Poetry Generation with Planning based Neural Networks 。
2018年11月8号更新:认真看了一下百度的那篇paper,模型跟TAV的差不了太多,就是先用一个RNN把关键词做个双向的encoding,然后当做第一个词放进去训练。没什么兴趣弄了。
第一篇论文里面放了三种策略,由简到繁分别是Topic-Averaged LSTM,Attention-based LSTM,以及Multi-Topic-Aware LSTM。
其实策略上来说,TAV-LSTM就是将topic的embedding做一个平均,然后作为prefix来训练,所以基本上网络设计上也和之前的Char-RNN差不多,比较容易实现。TAT-LSTM就是将topic做一个Attention,然后作为一个feature跟hidden并到一起喂到decoder里面去。MTA-LSTM还包含了一个叫做coverage vector的向量来计算topic的信息是否在训练过程中被喂进去了。
官方放了一个很久以前的TensorFlow版本的MTA-LSTM ,一方面我不喜欢TF,另一方面版本太老旧了,所以就用只能自己摸索着写PyTorch版本的了。数据就直接用的这个git上面提供的composition和zhihu两个数据。
然后这里都是用的贪婪法取候选词,没有做束搜索。当然,主要是因为懒,后面糟心事情过去了再说吧。
TAV
TAV的大概工作原理上面也提到了,这里不赘述。然后同样偷懒,用了之前Char-RNN的模型直接修改。
首先常规套路,训练词向量。说起来,腾讯之前开源了一个800w+的词向量,也可以用用。这个就不多说了,很简单。
然后就是处理一下数据,首先我们要加入四个特殊字符PAD,BOS,EOS,和UNK。都是常规套路。
1 2 3 4 5 6 fvec = KeyedVectors.load_word2vec_format('vec.txt' , binary=False ) word_vec = fvec.vectors vocab = ['<PAD>' , '<BOS>' , '<EOS>' , '<UNK>' ] vocab.extend(list (fvec.vocab.keys())) word_vec = np.concatenate((np.array([[0 ]*word_vec.shape[1 ]] * 4 ), word_vec)) word_vec = torch.tensor(word_vec)
然后就是要做idx to word和word to idx的转换器。
1 2 word_to_idx = {ch: i for i, ch in enumerate (vocab)} idx_to_word = {i: ch for i, ch in enumerate (vocab)}
然后就是读数据,做iterator。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 essays = [] topics = [] with open ('composition.txt' , 'r' ) as f: for line in f: essay, topic = line.replace('\n' , '' ).split(' </d> ' ) essays.append(essay.split(' ' )) topics.append(topic.split(' ' )) corpus_indice = list (map (lambda x: [word_to_idx[w] for w in x], essays[:8000 ])) topics_indice = list (map (lambda x: [word_to_idx[w] for w in x], topics[:8000 ])) length = list (map (lambda x: len (x), corpus_indice)) def tav_data_iterator (corpus_indice, topics_indice, batch_size, num_steps ): epoch_size = len (corpus_indice) // batch_size for i in range (epoch_size): raw_data = corpus_indice[i*batch_size: (i+1 )*batch_size] key_words = topics_indice[i*batch_size: (i+1 )*batch_size] data = np.zeros((len (raw_data), num_steps+1 ), dtype=np.int64) for i in range (batch_size): doc = raw_data[i] tmp = [1 ] tmp.extend(doc) tmp.extend([2 ]) tmp = np.array(tmp, dtype=np.int64) _size = tmp.shape[0 ] data[i][:_size] = tmp key_words = np.array(key_words, dtype=np.int64) x = data[:, 0 :num_steps] y = data[:, 1 :] mask = np.float32(x != 0 ) x = torch.tensor(x) y = torch.tensor(y) mask = torch.tensor(mask) key_words = torch.tensor(key_words) yield (x, y, mask, key_words)
这里也是简单处理了,很多细节慢慢修改吧,然后就是这里的mask,我也是偷懒不去弄了,其实是标识那些词用来训练,哪些是padding的,我后面在loss function那里直接将PAD的权重改成0了。
然后就是定义网络。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 class TAVLSTM (nn.Module ): def __init__ (self, hidden_dim, embed_dim, num_layers, weight, num_labels, bidirectional, dropout=0.5 , **kwargs ): super (TAVLSTM, self).__init__(**kwargs) self.hidden_dim = hidden_dim self.embed_dim = embed_dim self.num_layers = num_layers self.num_labels = num_labels self.bidirectional = bidirectional if num_layers <= 1 : self.dropout = 0 else : self.dropout = dropout self.embedding = nn.Embedding.from_pretrained(weight) self.embedding.weight.requires_grad = False self.rnn = nn.GRU(input_size=self.embed_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers, bidirectional=self.bidirectional, dropout=self.dropout) if self.bidirectional: self.decoder = nn.Linear(hidden_dim * 2 , self.num_labels) else : self.decoder = nn.Linear(hidden_dim, self.num_labels) def forward (self, inputs, topics, hidden=None ): embeddings = self.embedding(inputs) topics_embed = self.embedding(topics) topics_embed = topics_embed.mean(dim=1 ) for i in range (embeddings.shape[0 ]): embeddings[i][0 ] = topics_embed[i] states, hidden = self.rnn(embeddings.permute([1 , 0 , 2 ]).float (), hidden) outputs = self.decoder(states.reshape((-1 , states.shape[-1 ]))) return (outputs, hidden) def init_hidden (self, num_layers, batch_size, hidden_dim, **kwargs ): hidden = torch.zeros(num_layers, batch_size, hidden_dim) return hidden
基本结构没变化,就是forward的时候做了一点小修改,把第一个词变成topic average。
然后定义预测函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def predict_rnn (topics, num_chars, model, device, idx_to_word, word_to_idx ): output = [1 ] topics = [word_to_idx[x] for x in topics] topics = torch.tensor(topics) hidden = torch.zeros(num_layers, 1 , hidden_dim) if use_gpu: hidden = hidden.to(device) topics = topics.to(device) for t in range (num_chars): X = torch.tensor(output).reshape((1 , len (output))) if use_gpu: X = X.to(device) pred, hidden = model(X, topics, hidden) if pred.argmax(dim=1 )[-1 ] == 2 : break else : output.append(int (pred.argmax(dim=1 )[-1 ])) return ('' .join([idx_to_word[i] for i in output[1 :]]))
设定一下参数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 embedding_dim = 300 hidden_dim = 256 lr = 1e2 momentum = 0.0 num_epoch = 100 use_gpu = True num_layers = 1 bidirectional = False batch_size = 8 device = torch.device('cuda:0' ) loss_function = nn.CrossEntropyLoss() model = TAVLSTM(hidden_dim=hidden_dim, embed_dim=embedding_dim, num_layers=num_layers, num_labels=len (vocab), weight=word_vec, bidirectional=bidirectional) optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum) if use_gpu: model.to(device)
接着训练就好了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 since = time.time() for epoch in range (num_epoch): start = time.time() num, total_loss = 0 , 0 data = tav_data_iterator(corpus_indice, topics_indice, batch_size, max (length)+1 ) hidden = model.init_hidden(num_layers, batch_size, hidden_dim) weight = torch.ones(len (vocab)) weight[0 ] = 0 for X, Y, mask, topics in tqdm(data): num += 1 hidden.detach_() if use_gpu: X = X.to(device) Y = Y.to(device) mask = mask.to(device) topics = topics.to(device) hidden = hidden.to(device) weight = weight.to(device) optimizer.zero_grad() output, hidden = model(X, topics, hidden) l = F.cross_entropy(output, Y.t().reshape((-1 ,)), weight) l.backward() norm = nn.utils.clip_grad_norm_(model.parameters(), 1e-2 ) optimizer.step() total_loss += l.item() end = time.time() s = end - since h = math.floor(s / 3600 ) m = s - h * 3600 m = math.floor(m / 60 ) s -= m * 60 if (epoch % 10 == 0 ) or (epoch == (num_epoch - 1 )): print ('epoch %d/%d, loss %.4f, norm %.4f, time %.3fs, since %dh %dm %ds' %(epoch+1 , num_epoch, total_loss / num, norm, end-start, h, m, s)) print (predict_rnn(['妈妈' , '希望' , '长大' , '孩子' , '母爱' ], 100 , model, device, idx_to_word, word_to_idx))
具体还是看我的notebook 。不过梯度还是爆炸了,哎。
TAT
就是在TAV的基础上修改,直接看notebook 吧。这个模型深刻地表达了我的内心正处在TAT的状态。
MAT
论文里面的\(U_f\) 没看懂是什么意思,所以就自己演绎了一下,简单来说,为了让每个topic都有机会出现,那么很自然会想到要去调整Attention的权重,高的压低一点,低的抬高一点。所以我在每个epoch结束后调整一下: 1 2 3 params = model.state_dict() params['attn.weight' ].clamp_(0 ) params['attn.weight' ] *= 1 / -torch.log(params['attn.weight' ] / torch.sum (params['attn.weight' ]) + 0.000001 )
也就是说,每个Attention前面加了一个weight,这个weight是\(-\log(p+\lambda)\) 。加个lambda是避免变成0,而p就是这个topic的Attention在所有topic的Attention的比重。这个其实也可以试试每一个iteration就变化会怎么样。
具体看notebook 。
另外要说的就是,用全量softmax速度太慢了,改用了PyTorch的adaptive softmax。0.41版本以上自带的一个功能,是一个softmax的优化方案,论文说比hierarchical softmax在GPU上的表现更好,虽然论文没太看懂。这里有篇老外的博客 大概讲了一下原理,不过也没证明为什么会更好。大概意思就是将所有的词按照词频排序,然后分成高频和低频两组,然后低频组再拆成两到四个组,然后判断这个词是在哪个组里面。
还有就是TAT的模型改了一下要求Attention必须都是大于等于0的,MAT也是在这个基础上搞的。
差不多就这么一回事吧。诸事不顺,近期要看个病,但愿不是重病。