本文共 21354 字,大约阅读时间需要 71 分钟。
这个很全面:
协同过滤的缺点是
热点相关内容 往往变成了 其他同期出现的热点内容 .
先划分类别(比如SVD),再计算相关,效果往往更好.
1. CF算法(Collaborative Filtering)
http://en.wikipedia.org/wiki/Collaborative_filtering
转载:
协同过滤技术可以分为三类:
基于用户(User-based)的协同过滤
基于项目(Item-based)的协同过滤
基于模型(Model- based)的协同过滤
User-based
http://www.guwendong.cn/post/2006/user_based_collaborative_filtering.html
转载:
步骤一,收集可以代表用户兴趣的信息。
步骤二,最近邻搜索
步骤三,生成推荐结果。
Item-based
http://www.guwendong.cn/post/2006/item_based_collaborative_filtering.html
转载:
Item-based 方法需要同样的三个步骤:
1)得到User-item的评分数据;
2)针对项的最近邻搜索,即对项进行相似度计算
3)产生推荐
相对于 User-based 方法,Item-based 方法最大的改进是提高了协同过滤方法的扩展性及性能。
从上一篇中可以看到,在 User-based 方法中,随着用户数量的不断增多,在大数量级的用户范围内进行“最近邻搜索”会成为整个算法的瓶颈。
Item-based 方法通过计算项之间的相似性来代替用户之间的相似性。对于项来讲,它们之间的相似性要稳定很多,因此可以离线完成工作量最大的相似性计算步骤,从而大大降低了在线计算量,提高推荐效率。
在 Item-based 方法中,要对 A 和 B 进行项相似性计算,通常分为两步:
1)找出同时对 A 和 B 打过分的组合;
2)对这些组合进行相似度计算,常用的算法包括:
皮尔森相关系数(Person Correlation Coefficient)
余弦相似性(Cosine-based Similarity)
调整余弦相似性(Adjusted Cosine Similarity)
http://www.cnblogs.com/kuber/archive/2008/06/10/1216846.html
Slope One :简单高效的协同过滤算法(Collaborative Filtering)
转载:
Slope One;和其它类似算法相比, 它的最大优点在于算法很简单, 易于实现, 执行效率高, 同时推荐的准确性相对很高;
基本概念
Slope One的基本概念很简单, 例子1, 用户X, Y和A都对Item1打了分. 同时用户X,Y还对Item2打了分, 用户A对Item2可能会打多少分呢?
User Rating to Item 1 Rating to Item 2
X 5 3
Y 4 3
A 4 ?
根据SlopeOne算法, 应该是:4 - ((5-3) + (4-2))/2 = 2.5.
解释一下. 用户X对Item1的rating是5, 对Item2的rating是3, 那么他可能认为Item2应该比Item1少两分. 同时用户Y认为Item2应该比Item1少1分. 据此我们知道所有对Item1和Item2都打了分的用户认为Item2会比Item1平均少1.5分. 所以我们有理由推荐用户A可能会对Item2打(4-1.5)=2.5分;
加权算法
接下来我们看看加权算法(Weighted Slope One). 如果有100个用户对Item1和Item2都打过分, 有1000个用户对Item3和Item2也打过分. 显然这两个rating差的权重是不一样的. 因此我们的计算方法是
(100*(Rating 1 to 2) + 1000(Rating 3 to 2)) / (100 + 1000)
上面讨论的是用户只对项目的喜好程度打分.还有一种情况下用户也可以对项目的厌恶程度打分. 这时可以使用双极SlopeOne算法(BI-Polar SlopeOne).
下面是实现:
During a lunchtime conversation the other day, a coworker mentioned that he was hacking in his spare time on an entry for the . This got me to thinking about collaborative filtering: why had I never seen a good description of how to do it? I suspect that people who might ordinarily have a casual interest in the subject hear that there are some statistics involved, whereupon they immediately freeze in the mathematical headlights, and turn the conversation to something else, anything else. In early 2005, a researcher named published, with Anna Maclachlan, a paper with the jazzy title of ““. This is an important paper, because it presents a family ofreally simple collaborative filtering schemes. I mean really simple: there are no statistics involved, just a little bit of linear algebra. Unfortunately, because the Lemire/Maclachlan paper is aimed at an academic audience, it’s not trivial to tell by merely reading it how simple the technique it presents is, so what follows is my attempt to translate its ideas into my native language of Attention Deficit Programmerese. To make things more concrete, I’m going to present an implementation in less than 40 lines of Python (and I’ll try to explain any obscure bits of Python as I go). Cool, huh? Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.
class SlopeOne(object):
def __init__(self):
self.diffs = {}
self.freqs = {}The self.diffs matrix will hold differential ratings (I’ll describe what these are soon), and the self.freqs matrix will contain the number of times we’ve computed a differential rating for each pair of items. We’ll fill in the first and second levels of the dicts once we know what our data looks like, so let’s talk about the shape of the data. We need to be able to express ratings such as “Alice rated a squid as 1, and a cuttlefish as 4″. This is easy to express with two levels of dicts; the first maps from a user to all of that user’s ratings, and the second maps items to their ratings. In other words, our user data looks like this in Python:
userdata = dict(alice=dict(squid=1, cuttlefish=4)),
bob=dict(squid=3, octopus=2))The time has come to load the data into our matrices; let’s call the method update.
def update(self, data):To fill in the matrices, we must look at each rating from every user.
for user, ratings in data.iteritems():For each user and rating they have submitted, we will be iterating over every rating they have submitted a second time. The first thing we do is make sure that the necessary second-level dict s in each matrix exist.
for item, rating in ratings.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})The setdefault call is an unfortunately ugly Pythonic idiom that means “if there’s no value for the given key (the first parameter) in the dict , assign it the given value (the second parameter)”. Inside this loop, we iterate over the user’s ratings again, to compare each rating against every other.
for item2, rating2 in ratings.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)Here, we ensure that there are zeroes present in the matrix for every pair of ratings. (Notice that we’re using an integer zero to count frequencies, and a floating point zero for ratings.) For every pair of ratings, we modify the appropriate cell in self.freqs by adding one to the number of times we’ve seen that pair of ratings.
self.freqs[item][item2] += 1And to update the self.diffs matrix, we compute the difference between the ratings for the two items.
self.diffs[item][item2] += rating - rating2This is why I called self.diffs the “differential rating” matrix. Every time a user rates an item, we update the appropriate cells in this matrix to reflect the accumulated difference between that rating and every other rating the user has made. Once we’re done, we scale each differential rating based on the number of times we’ve seen that pair of items.
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]Now let’s pull the whole thing together, so you can see how simple this update method is.
def update(self, data):
for user, prefs in data.iteritems():
for item, rating in prefs.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})
for item2, rating2 in prefs.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
self.freqs[item][item2] += 1
self.diffs[item][item2] += item - item2
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]That’s a measly 13 lines of code, 4 of which are due to us using dict objects to represent matrices. Nice! (By the way, notice that the update method made no use of the user IDs we passed in. This technique doesn’t care who rated what , just how an item has been rated. Collaborative filtering techniques that don’t use user information directly are called “item-based”, while those that do are called “user-based”.) It’s easy to fill out our matrices:
s = SlopeOne()
s.update(dict(alice=dict(squid=1.0, cuttlefish=4.0)),
bob=dict(squid=1.0, octupus=3.0)))With a little data in place, we’d like to make a prediction. To do this, we’ll pass the ratings a user has made to the predict method; it will return predictions of ratings for items the user hasn’t rated.
def predict(self, userprefs):We begin with two empty dict s; preds will accumulate prediction data, and freqs will accumulate counts.
preds, freqs = {}, {}We iterate over every item the user has rated, for items that have been rated by anyone (the try/except clause below lets us skip over unrated item pairs).
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continueWe make sure there are zeroes present in our dict s before we accumulate anything into them.
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)The accumulation step is simple.
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freqWe must filter the result a little before we return it. We return a new dict , containing weighted predictions for every item, but omitting items that the user has rated (no point in predicting those), and omitting items for which we have a cumulative frequency of zero.
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])And now let’s see the predict method without all of those intrusive comments, again to see how simple it is.
def predict(self, userprefs):
preds, freqs = {}, {}
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continue
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freq
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])Our proof-of-concept Weighted Slope One predictor clocks in at 33 lines of Python. If we had a sparse 2D matrix class handy, we could trim six of those lines. That’s not bad! It wouldn’t be difficult to go from this trivial implementation to something truly efficient. From the way we construct the matrices in the update method, we can make a few potentially space saving observations.
这是源码:
I renamed the variables and then reorganized the code a bit.
The(update,SlopeOne,predict) definitions can be compared with the newer (update2,SlopeOne',predict') definitions.
module WeightedSlopeOne (Rating, SlopeOne, empty, predict, update) where import Data.List (foldl',foldl1')import qualified Data.Map as M -- The item type is a polymorphic parameter. Since it goes into a Map-- it must be able to be compared, so item must be an instance of Ord.type Count = Inttype RatingValue = Double-- The Rating is the known (item,Rating) information for a particular "user"type Rating item = M.Map item RatingValue -- The SlopeOne matrix is indexed by pairs of items and is implemented-- as a sparse map of maps.newtype SlopeOne item = SlopeOne (M.Map item (M.Map item (Count,RatingValue))) deriving (Show) -- The SlopeOne' matrix is an un-normalized version of SlopeOnenewtype SlopeOne' item = SlopeOne' (M.Map item (M.Map item (Count,RatingValue))) deriving (Show) empty = SlopeOne M.emptyempty' = SlopeOne' M.empty -- This performs a strict addition on pairs made of two nuumeric typesaddT (a,b) (c,d) = let (l,r) = (a+c, b+d) in l `seq` r `seq` (l, r) -- There is never an entry for the "diagonal" elements with equal-- items in the pair: (foo,foo) is never in the SlopeOne.update :: Ord item => SlopeOne item -> [Rating item] -> SlopeOne itemupdate (SlopeOne matrixInNormed) usersRatings = SlopeOne . M.map (M.map norm) . foldl' update' matrixIn $ usersRatings where update' oldMatrix userRatings = foldl' (\oldMatrix (itemPair, rating) -> insert oldMatrix itemPair rating) oldMatrix itemCombos where itemCombos = [ ((item1, item2), (1, rating1 - rating2)) | (item1, rating1) <- ratings , (item2, rating2) <- ratings , item1 /= item2] ratings = M.toList userRatings insert outerMap (item1, item2) newRating = M.insertWith' outer item1 newOuterEntry outerMap where newOuterEntry = M.singleton item2 newRating outer _ innerMap = M.insertWith' addT item2 newRating innerMap norm (count,total_rating) = (count, total_rating / fromIntegral count) un_norm (count,rating) = (count, rating * fromIntegral count) matrixIn = M.map (M.map un_norm) matrixInNormed -- This version of update2 makes an unnormalize slopeOne' from each-- Rating and combines them using Map.union* operations and addT.update2 :: Ord item => SlopeOne' item -> [Rating item] -> SlopeOne' itemupdate2 s@(SlopeOne' matrixIn) usersRatingsIn | null usersRatings = s | otherwise = SlopeOne' . M.unionsWith (M.unionWith addT) . (matrixIn:) . map fromRating $ usersRatings where usersRatings = filter ((1<) . M.size) usersRatingsIn fromRating userRating = M.mapWithKey expand1 userRating where expand1 item1 rating1 = M.mapMaybeWithKey expand2 userRating where expand2 item2 rating2 | item1 == item2 = Nothing | otherwise = Just (1,rating1 - rating2) predict :: Ord a => SlopeOne a -> Rating a -> Rating apredict (SlopeOne matrixIn) userRatings = let freqM = foldl' insert M.empty [ (item1,found_rating,user_rating) | (item1,innerMap) <- M.assocs matrixIn , M.notMember item1 userRatings , (user_item, user_rating) <- M.toList userRatings , item1 /= user_item , found_rating <- M.lookup user_item innerMap ] insert oldM (item1,found_rating,user_rating) = let (count,norm_rating) = found_rating total_rating = fromIntegral count * (norm_rating + user_rating) in M.insertWith' addT item1 (count,total_rating) oldM normM = M.map (\(count, total_rating) -> total_rating / fromIntegral count) freqM in M.filter (\norm_rating -> norm_rating > 0) normM -- This is a modified version of predict. It also expect the-- unnormalized SlopeOne' but this is a small detailpredict' :: Ord a => SlopeOne' a -> Rating a -> Rating apredict' (SlopeOne' matrixIn) userRatings = M.mapMaybe calcItem (M.difference matrixIn userRatings) where calcItem innerMap | M.null combined = Nothing | norm_rating <= 0 = Nothing | otherwise = Just norm_rating where combined = M.intersectionWith weight innerMap userRatings (total_count,total_rating) = foldl1' addT (M.elems combined) norm_rating = total_rating / fromIntegral total_count weight (count,rating) user_rating = (count,rating + fromIntegral count * user_rating) userData :: [Rating String]userData = map M.fromList [ [("squid", 1.0), ("cuttlefish", 0.5), ("octopus", 0.2)], [("squid", 1.0), ("octopus", 0.5), ("nautilus", 0.2)], [("squid", 0.2), ("octopus", 1.0), ("cuttlefish", 0.4), ("nautilus", 0.4)], [("cuttlefish", 0.9), ("octopus", 0.4), ("nautilus", 0.5)] ] userInfo = M.fromList [("squid", 0.4),("cuttlefish",0.9),("dolphin",1.0)] predictions = predict (update empty userData) userInfo predictions' = predict' (update2 empty' userData) userInfo
The changes to SlopeOne/update/predict below use a different internal data structure for storing the sparse matrix of SlopeOne. Instead of a Map of Map design it uses a Map of List design and keeps the List in distinct ascending form. The list values are a strict (data Tup) type which should help save space compared to the previous inner Map design and yet efficiently provide all the operations needed by update and predict.
Much of the logic of prediction is in the computeRating helper function.
-- The SlopeOne matrix is indexed by pairs of items and is implemented-- as a sparse map of distinct ascending lists. The 'update' and-- 'predict' functions do not need the inner type to actually be a-- map, so the list saves space and complexity.newtype SlopeOne item = SlopeOne (M.Map item [Tup item]) deriving (Show) -- Strict triple tuple type for SlopeOne internalsdata Tup item = Tup { itemT :: !item, countT :: !Count, ratingT :: !RatingValue } deriving (Show) empty :: SlopeOne itemempty = SlopeOne M.empty update :: Ord item => SlopeOne item -> [Rating item] -> SlopeOne itemupdate s@(SlopeOne matrixIn) usersRatingsIn | null usersRatings = s | otherwise = SlopeOne . M.unionsWith mergeAdd . (matrixIn:) . map fromRating $ usersRatings where usersRatings = filter ((1<) . M.size) usersRatingsIn -- fromRating converts a Rating into a Map of Lists, a singleton SlopeOne. fromRating userRatings = M.mapWithKey expand userRatings where expand item1 rating1 = map makeTup . M.toAscList . M.delete item1 $ userRatings where makeTup (item2,rating2) = Tup item2 1 (rating1-rating2) -- 'mergeAdd' is a helper for 'update'.-- Optimized traversal of distinct ascending lists to perform additive merge.mergeAdd :: Ord item => [Tup item] -> [Tup item] -> [Tup item]mergeAdd !xa@(x:xs) !ya@(y:ys) = case compare (itemT x) (itemT y) of LT -> x : mergeAdd xs ya GT -> y : mergeAdd xa ys EQ -> Tup (itemT x) (countT x + countT y) (ratingT x + ratingT y) : mergeAdd xs ysmergeAdd xs [] = xsmergeAdd [] ys = ys -- The output Rating has no items in common with the input Rating and-- only includes positively weighted ratings.predict :: Ord item => SlopeOne item -> Rating item -> Rating itempredict (SlopeOne matrixIn) userRatings = M.mapMaybe (computeRating ratingList) (M.difference matrixIn userRatings) where ratingList = M.toAscList userRatings -- 'computeRating' is a helper for 'predict'.-- Optimized traversal of distinct ascending lists to compute positive weighted rating.computeRating :: (Ord item) => [(item,RatingValue)] -> [Tup item] -> Maybe RatingValuecomputeRating !xa@(x:xs) !ya@(y:ys) = case compare (fst x) (itemT y) of LT -> computeRating xs ya GT -> computeRating xa ys EQ -> helper (countT y) (ratingT y + fromIntegral (countT y) * snd x) xs ys where helper :: (Ord item) => Count -> RatingValue -> [(item,RatingValue)] -> [Tup item] -> Maybe RatingValue helper !count !rating !xa@(x:xs) !ya@(y:ys) = case compare (fst x) (itemT y) of LT -> helper count rating xs ya GT -> helper count rating xa ys EQ -> helper (count + countT y) (rating + ratingT y + fromIntegral (countT y) * (snd x)) xs ys helper !count !rating _ _ | rating > 0 = Just (rating / fromIntegral count) | otherwise = NothingcomputeRating _ _ = Nothing
这个也是很全: