论文标题
功能数量对建议算法性能的影响:Movielens-100K案例研究
The Impact of Feature Quantity on Recommendation Algorithm Performance: A Movielens-100K Case Study
论文作者
论文摘要
最近的基于模型的推荐系统(RECSYS)算法强调了功能的使用(也称为侧面信息),其设计类似于机器学习算法(ML)。相比之下,一些最受欢迎和传统的算法仅专注于给定的用户项目的关系,而无需包括附带信息。该案例研究的目的是在包括附带信息时对RECS和ML算法进行性能比较和评估。我们选择了Movielens-100K数据集,因为它是比较Recsys算法的标准。我们比较了六个不同的特征集,这些功能集具有不同数量的特征,这些特征是从基线数据生成的,并根据19种RECSYS算法,基线ML算法,自动化机器学习(AUTOML)管道和最先进的Recsys算法进行了评估。结果表明,其他功能使我们评估的所有算法受益。但是,特征数量和性能之间的相关性对于Automl和Recsys并不是单调的。在这些类别中,对特征重要性的分析表明,特征的质量比数量更重要。在我们的整个实验中,与根平方误差有关的特征设置的平均功能设置的平均性能比最高的功能差约6%。一个有趣的观察结果是,当使用其他功能时,Automl优于基于矩阵分解的RECSYS算法。使用最高数量的功能时,几乎所有可以包含侧面信息的算法都具有更高的性能。在其他情况下,性能差异可以忽略不计(<1%)。结果表明,特征数量的影响以及特征质量对评估算法的重要影响。
Recent model-based Recommender Systems (RecSys) algorithms emphasize on the use of features, also called side information, in their design similar to algorithms in Machine Learning (ML). In contrast, some of the most popular and traditional algorithms for RecSys solely focus on a given user-item-rating relation without including side information. The goal of this case study is to provide a performance comparison and assessment of RecSys and ML algorithms when side information is included. We chose the Movielens-100K data set since it is a standard for comparing RecSys algorithms. We compared six different feature sets with varying quantities of features which were generated from the baseline data and evaluated on a total of 19 RecSys algorithms, baseline ML algorithms, Automated Machine Learning (AutoML) pipelines, and state-of-the-art RecSys algorithms that incorporate side information. The results show that additional features benefit all algorithms we evaluated. However, the correlation between feature quantity and performance is not monotonous for AutoML and RecSys. In these categories, an analysis of feature importance revealed that the quality of features matters more than quantity. Throughout our experiments, the average performance on the feature set with the lowest number of features is about 6% worse compared to that with the highest in terms of the Root Mean Squared Error. An interesting observation is that AutoML outperforms matrix factorization-based RecSys algorithms when additional features are used. Almost all algorithms that can include side information have higher performance when using the highest quantity of features. In the other cases, the performance difference is negligible (<1%). The results show a clear positive trend for the effect of feature quantity as well as the important effects of feature quality on the evaluated algorithms.
