2018 年 4 月 23 日 - Sara Robinson 发布
你能给“优雅、细致的单宁”、“成熟的黑醋栗香气”或“浓郁而烤香”标上价格吗?事实证明,机器学习模型可以。在这篇文章中,我将解释我是如何使用 Keras (tf.keras) 构建一个宽而深的网络来根据葡萄酒的描述预测其价格。对于那些刚接触 Keras 的人来说,它是一个用于构建 ML 模型的更高级别的 TensorFlow API。一个…
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from tensorflow import keras
layers = keras.layers
# This code was tested with TensorFlow v1.7
print("You have TensorFlow version", tf.__version__)
由于我们模型的输出(预测)是价格的数值,因此我们将直接将价格值输入到模型中进行训练和评估。该模型的完整代码可以在 GitHub 上找到。在这里,我将重点介绍关键点。首先,让我们下载数据并将其转换为 Pandas 数据帧!wget -q https://storage.googleapis.com/sara-cloud-ml/wine_data.csv
data = pd.read_csv("wine_data.csv")
接下来,我们将将其拆分为训练集和测试集,并提取特征和标签train_size = int(len(data) * .8)
# Train features
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]
# Train labels
labels_train = data['price'][:train_size]
# Test features
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]
# Test labels
labels_test = data['price'][train_size:]
将词袋模型的输入想象成 Scrabble 棋盘,每个棋盘上都包含一个单词(而不是一个字母),这些单词来自我们的输入。 |
Tokenizer
类来创建词袋词汇表vocab_size = 12000
tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(description_train) # only fit on train
然后,我们将使用 texts_to_matrix
函数将每个描述转换为词袋向量description_bow_train = tokenize.texts_to_matrix(description_train)
description_bow_test = tokenize.texts_to_matrix(description_test)
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(variety_train)
variety_train = encoder.transform(variety_train)
variety_test = encoder.transform(variety_test)
num_classes = np.max(variety_train) + 1
# Convert labels to one hot
variety_train = keras.utils.to_categorical(variety_train, num_classes)
variety_test = keras.utils.to_categorical(variety_test, num_classes)
现在,我们准备好构建宽模型了。bow_inputs = layers.Input(shape=(vocab_size,))
variety_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([bow_inputs, variety_inputs])
merged_layer = layers.Dense(256, activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)
wide_model = Model(inputs=[bow_inputs, variety_inputs], outputs=predictions)
然后,我们将编译模型,使其可以使用wide_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
如果我们单独使用宽模型,这里我们将使用 fit()
进行训练,使用 evaluate()
进行评估。由于我们稍后将把它与深模型结合起来,所以我们可以等到两个模型结合在一起后再进行训练。是时候构建我们的深模型了!texts_to_sequences
方法做到这一点train_embed = tokenize.texts_to_sequences(description_train)
test_embed = tokenize.texts_to_sequences(description_test)
现在我们已经获得了整数化的描述向量,我们需要确保它们都具有相同的长度,以便将它们输入我们的模型。Keras 也提供了一种方便的方法来实现这一点。我们将使用 pad_sequences
在每个描述向量中添加零,以便它们都具有相同的长度(我使用了 170 作为最大长度,这样就不会截断任何描述)max_seq_length = 170
train_embed = keras.preprocessing.sequence.pad_sequences(train_embed, maxlen=max_seq_length)
test_embed = keras.preprocessing.sequence.pad_sequences(test_embed, maxlen=max_seq_length)
将我们的描述转换为具有相同长度的向量后,我们就可以创建嵌入层并将其输入到深层模型中。deep_inputs = layers.Input(shape=(max_seq_length,))
embedding = layers.Embedding(vocab_size, 8, input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
一旦 Embedding 层被扁平化,它就可以被输入到模型中并编译。embed_out = layers.Dense(1, activation='linear')(embedding)
deep_model = Model(inputs=deep_inputs, outputs=embed_out)
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(1)(merged_out)
combined_model = Model(wide_model.input + [deep_model.input], merged_out)
combined_model.compile(loss='mse',optimizer='adam', metrics=['accuracy'])
有了这些,就可以开始训练和评估了。您可以尝试不同的训练周期和批次大小,看看哪个最适合您的数据集。# Training
combined_model.fit([description_bow_train, variety_train] + [train_embed], labels_train, epochs=10, batch_size=128)
# Evaluation
combined_model.evaluate([description_bow_test, variety_test] + [test_embed], labels_test, batch_size=128)
predict()
方法,并将测试数据集(在以后的文章中,我将介绍如何从纯文本输入中获得预测)传递给它。predictions = combined_model.predict([description_bow_test, variety_test] + [test_embed])
然后,我们将测试数据集前 15 种葡萄酒的预测值与实际值进行比较。for i in range(15):
val = predictions[i]
print(description_test[i])
print(val[0], 'Actual: ', labels_test.iloc[i], '\n')
模型表现如何?让我们看一下测试集中三个例子。Powerful vanilla scents rise from the glass, but the fruit, even in this difficult vintage, comes out immediately. It's tart and sharp, with a strong herbal component, and the wine snaps into focus quickly with fruit, acid, tannin, herb and vanilla in equal proportion. Firm and tight, still quite young, this wine needs decanting and/or further bottle age to show its best.
Predicted: 46.233624 Actual: 45.0
A good everyday wine. It's dry, full-bodied and has enough berry-cherry flavors to get by, wrapped into a smooth texture.
Predicted: 9.694958 Actual: 10.0
Here's a modern, round and velvety Barolo (from Monforte d'Alba) that will appeal to those who love a thick and juicy style of wine. The aromas include lavender, allspice, cinnamon, white chocolate and vanilla. Tart berry flavors backed by crisp acidity and firm tannins give the mouthfeel determination and grit.
Predicted: 41.028854 Actual: 49.0
相当不错!事实证明,葡萄酒的描述与其价格之间存在某种关系。我们可能无法直观地看到它,但我们的机器学习模型可以。
2018 年 4 月 23 日 — 作者:Sara Robinson
你能给“优雅、细腻的单宁”、“成熟的黑醋栗香气”或“浓郁而烤香”标价吗?事实证明,机器学习模型可以做到。在这篇文章中,我将解释我如何使用 Keras(tf.keras)构建一个宽与深网络来预测葡萄酒的价格。对于那些不熟悉 Keras 的人来说,它是 TensorFlow 的高级 API,用于构建机器学习模型。一个…