博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
协同过滤
阅读量:4152 次
发布时间:2019-05-25

本文共 21354 字,大约阅读时间需要 71 分钟。

这个很全面:

协同过滤的缺点是 

热点相关内容 往往变成了 其他同期出现的热点内容 . 

先划分类别(比如SVD),再计算相关,效果往往更好. 

1. CF算法(Collaborative Filtering) 

http://en.wikipedia.org/wiki/Collaborative_filtering 
转载: 
协同过滤技术可以分为三类: 
基于用户(User-based)的协同过滤 
基于项目(Item-based)的协同过滤 
基于模型(Model- based)的协同过滤 


User-based 
http://www.guwendong.cn/post/2006/user_based_collaborative_filtering.html 
转载: 
步骤一,收集可以代表用户兴趣的信息。 
步骤二,最近邻搜索 
步骤三,生成推荐结果。 

Item-based 
http://www.guwendong.cn/post/2006/item_based_collaborative_filtering.html 

转载: 
Item-based 方法需要同样的三个步骤: 
1)得到User-item的评分数据; 
2)针对项的最近邻搜索,即对项进行相似度计算 
3)产生推荐 

相对于 User-based 方法,Item-based 方法最大的改进是提高了协同过滤方法的扩展性及性能。 

从上一篇中可以看到,在 User-based 方法中,随着用户数量的不断增多,在大数量级的用户范围内进行“最近邻搜索”会成为整个算法的瓶颈。 

Item-based 方法通过计算项之间的相似性来代替用户之间的相似性。对于项来讲,它们之间的相似性要稳定很多,因此可以离线完成工作量最大的相似性计算步骤,从而大大降低了在线计算量,提高推荐效率。 

在 Item-based 方法中,要对 A 和 B 进行项相似性计算,通常分为两步: 
1)找出同时对 A 和 B 打过分的组合; 
2)对这些组合进行相似度计算,常用的算法包括: 
皮尔森相关系数(Person Correlation Coefficient) 
余弦相似性(Cosine-based Similarity) 
调整余弦相似性(Adjusted Cosine Similarity) 



http://www.cnblogs.com/kuber/archive/2008/06/10/1216846.html 
Slope One :简单高效的协同过滤算法(Collaborative Filtering) 

转载: 
Slope One;和其它类似算法相比, 它的最大优点在于算法很简单, 易于实现, 执行效率高, 同时推荐的准确性相对很高; 

基本概念 

Slope One的基本概念很简单, 例子1, 用户X, Y和A都对Item1打了分. 同时用户X,Y还对Item2打了分, 用户A对Item2可能会打多少分呢? 
User Rating to Item 1 Rating to Item 2 
X 5 3 
Y 4 3 
A 4 ? 

根据SlopeOne算法, 应该是:4 - ((5-3) + (4-2))/2 = 2.5. 

解释一下. 用户X对Item1的rating是5, 对Item2的rating是3, 那么他可能认为Item2应该比Item1少两分. 同时用户Y认为Item2应该比Item1少1分. 据此我们知道所有对Item1和Item2都打了分的用户认为Item2会比Item1平均少1.5分. 所以我们有理由推荐用户A可能会对Item2打(4-1.5)=2.5分; 

加权算法 

接下来我们看看加权算法(Weighted Slope One). 如果有100个用户对Item1和Item2都打过分, 有1000个用户对Item3和Item2也打过分. 显然这两个rating差的权重是不一样的. 因此我们的计算方法是 

(100*(Rating 1 to 2) + 1000(Rating 3 to 2)) / (100 + 1000) 

上面讨论的是用户只对项目的喜好程度打分.还有一种情况下用户也可以对项目的厌恶程度打分. 这时可以使用双极SlopeOne算法(BI-Polar SlopeOne). 

下面是实现:

During a lunchtime conversation the other day, a coworker mentioned that he was hacking in his spare time on an entry for the . This got me to thinking about collaborative filtering: why had I never seen a good description of how to do it? I suspect that people who might ordinarily have a casual interest in the subject hear that there are some statistics involved, whereupon they immediately freeze in the mathematical headlights, and turn the conversation to something else, anything else. In early 2005, a researcher named published, with Anna Maclachlan, a paper with the jazzy title of ““. This is an important paper, because it presents a family ofreally simple collaborative filtering schemes. I mean really simple: there are no statistics involved, just a little bit of linear algebra. Unfortunately, because the Lemire/Maclachlan paper is aimed at an academic audience, it’s not trivial to tell by merely reading it how simple the technique it presents is, so what follows is my attempt to translate its ideas into my native language of Attention Deficit Programmerese. To make things more concrete, I’m going to present an implementation in less than 40 lines of Python (and I’ll try to explain any obscure bits of Python as I go). Cool, huh? Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.

  1. You have your crowd of users.
  2. You have your big pile of items.
  3. Some of your users have been generous enough to tell you what they think of some of your items.
  4. You want to hawk more items to a user, and you’d prefer to make your suggestions relevant to their likely interests.
Items 1 through 3 suggest that you could use the opinions that people have recorded about shiny doodads they 
have
 seen, to give a good guess as to which doodads they 
haven’t
 seen, but 
might like
. I am not going to explain the mathematics behind the Slope One predictor, because Lemire and Maclachlan’s paper pretty much pulls the numerical justification out of a hat. I’m not immersed in the literature of the field, so I’ll be lazy and take them at their word when they say their approach works; and hey, that’s easy code to write, so let’s get hacking! (The scheme I’m going to talk about is referred to as “Weighted Slope One” in the Lemire and Maclachlan paper.) My Slope One implementation stores its data in two 2D matrices. The X and Y axes are identical in each matrix. If the items to be rated are “cuttlefish”, “octopus”, and “squid”, we end up with two 3×3 matrices, each with the following rows and columns:
matrix.png
 The coloured cells above indicate which elements of the matrix contain useful information. I’ll describe why most of the matrix is empty later. To keep things simple, let’s represent our matrices as two levels of Python 
dict
 objects (a dict is simply a hash table, if you’re not familiar with Python). The key of each dict is an item ID, so to get the matrix entry for “squid vs cuttlefish”, we look in the first-level dict for squid, then the second-level dict for cuttlefish.
class SlopeOne(object):
def __init__(self):
self.diffs = {}
self.freqs = {}
The 
self.diffs
 matrix will hold differential ratings (I’ll describe what these are soon), and the
self.freqs
 matrix will contain the number of times we’ve computed a differential rating for each pair of items. We’ll fill in the first and second levels of the dicts once we know what our data looks like, so let’s talk about the shape of the data. We need to be able to express ratings such as “Alice rated a squid as 1, and a cuttlefish as 4″. This is easy to express with two levels of dicts; the first maps from a user to all of that user’s ratings, and the second maps items to their ratings. In other words, our user data looks like this in Python:
userdata = dict(alice=dict(squid=1, cuttlefish=4)),
bob=dict(squid=3, octopus=2))
The time has come to load the data into our matrices; let’s call the method update.
def update(self, data):
To fill in the matrices, we must look at each rating from every user.
for user, ratings in data.iteritems():
For each user and rating they have submitted, we will be iterating over every rating they have submitted a second time. The first thing we do is make sure that the necessary second-level
dict
s in each matrix exist.
for item, rating in ratings.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})
The 
setdefault
 call is an unfortunately ugly Pythonic idiom that means “if there’s no value for the given key (the first parameter) in the 
dict
, assign it the given value (the second parameter)”. Inside this loop, we iterate over the user’s ratings again, to compare each rating against every other.
for item2, rating2 in ratings.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
Here, we ensure that there are zeroes present in the matrix for every pair of ratings. (Notice that we’re using an integer zero to count frequencies, and a floating point zero for ratings.) For every pair of ratings, we modify the appropriate cell in 
self.freqs
 by adding one to the number of times we’ve seen that pair of ratings.
self.freqs[item][item2] += 1
And to update the 
self.diffs
 matrix, we compute the difference between the ratings for the two items.
self.diffs[item][item2] += rating - rating2
This is why I called 
self.diffs
 the “differential rating” matrix. Every time a user rates an item, we update the appropriate cells in this matrix to reflect the accumulated difference between 
that
rating and 
every other
 rating the user has made. Once we’re done, we scale each differential rating based on the number of times we’ve seen that pair of items.
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]
Now let’s pull the whole thing together, so you can see how simple this 
update
 method is.
def update(self, data):
for user, prefs in data.iteritems():
for item, rating in prefs.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})
for item2, rating2 in prefs.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
self.freqs[item][item2] += 1
self.diffs[item][item2] += item - item2
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]
That’s a measly 13 lines of code, 4 of which are due to us using 
dict
 objects to represent matrices. Nice! (By the way, notice that the 
update
 method made no use of the user IDs we passed in. This technique doesn’t care 
who
 rated 
what
, just 
how
 an item has been rated. Collaborative filtering techniques that don’t use user information directly are called “item-based”, while those that do are called “user-based”.) It’s easy to fill out our matrices:
s = SlopeOne()
s.update(dict(alice=dict(squid=1.0, cuttlefish=4.0)),
bob=dict(squid=1.0, octupus=3.0)))
With a little data in place, we’d like to make a prediction. To do this, we’ll pass the ratings a user has made to the 
predict
 method; it will return predictions of ratings for items the user hasn’t rated.
def predict(self, userprefs):
We begin with two empty 
dict
s; 
preds
 will accumulate prediction data, and 
freqs
 will accumulate counts.
preds, freqs = {}, {}
We iterate over every item the user has rated, for items that have been rated by anyone (the try/except clause below lets us skip over unrated item pairs).
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continue
We make sure there are zeroes present in our 
dict
s before we accumulate anything into them.
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)
The accumulation step is simple.
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freq
We must filter the result a little before we return it. We return a new 
dict
, containing weighted predictions for every item, but omitting items that the user has rated (no point in predicting those), and omitting items for which we have a cumulative frequency of zero.
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])
And now let’s see the 
predict
 method without all of those intrusive comments, again to see how simple it is.
def predict(self, userprefs):
preds, freqs = {}, {}
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continue
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freq
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])
Our proof-of-concept Weighted Slope One predictor clocks in at 33 lines of Python. If we had a sparse 2D matrix class handy, we could trim six of those lines. That’s not bad! It wouldn’t be difficult to go from this trivial implementation to something truly efficient. From the way we construct the matrices in the update method, we can make a few potentially space saving observations.
  • Some portions of each matrix are entirely empty. They don’t contain zeroes; they contain nothing, because nobody has rated those pairs of items. If Alice rates squid and cuttlefish, while Bob rates squid and octopus, there will be no entries for cuttlefish vs octopus in either matrix.
  • The diagonal of the self.diffs matrix will always be either empty or zero, so there’s no need to store values on the diagonal of the self.freqs matrix.
  • Each entry in the upper right of the self.diffs matrix is the negative of the symmetric entry in the bottom left. The upper right of the self.freqs matrix is symmetric to the bottom left (corresponding entries have identical values).
We could thus represent each matrix as a strictly lower triangular matrix. Because much of a matrix is going to be empty for real databases of items and ratings, we would save a lot by storing it in sparse form. If you want to build a web site that uses collaborative filtering, and you’re not a specialist, the Slope One scheme has a lot going for it. It is simple to implement, and the arithmetic is easy to understand (in stark contrast to techniques that use singular value decomposition, cosine correlation or Pearson’s correlation coefficient, or other gymnastic statistical approaches). It’s also easy to see that updates and predictions can both be performed in data parallel style (predictions are easy to express in mapreduce form). Better yet, it’s easy and cheap to update the Slope One data structures on the fly, instead of taking everything offline and performing a heavyweight nightly thrashing of a database.

这是源码:

Chris' version

I renamed the variables and then reorganized the code a bit.

The(update,SlopeOne,predict) definitions can be compared with the newer (update2,SlopeOne',predict') definitions.

module WeightedSlopeOne (Rating, SlopeOne, empty, predict, update) where import Data.List (foldl',foldl1')import qualified Data.Map as M -- The item type is a polymorphic parameter.  Since it goes into a Map-- it must be able to be compared, so item must be an instance of Ord.type Count = Inttype RatingValue = Double-- The Rating is the known (item,Rating) information for a particular "user"type Rating item = M.Map item RatingValue -- The SlopeOne matrix is indexed by pairs of items and is implemented-- as a sparse map of maps.newtype SlopeOne item = SlopeOne (M.Map item (M.Map item (Count,RatingValue)))  deriving (Show) -- The SlopeOne' matrix is an un-normalized version of SlopeOnenewtype SlopeOne' item = SlopeOne' (M.Map item (M.Map item (Count,RatingValue)))  deriving (Show) empty = SlopeOne M.emptyempty' = SlopeOne' M.empty -- This performs a strict addition on pairs made of two nuumeric typesaddT (a,b) (c,d) = let (l,r) = (a+c, b+d) in l `seq` r `seq` (l, r)  -- There is never an entry for the "diagonal" elements with equal-- items in the pair: (foo,foo) is never in the SlopeOne.update :: Ord item => SlopeOne item -> [Rating item] -> SlopeOne itemupdate (SlopeOne matrixInNormed) usersRatings =    SlopeOne . M.map (M.map norm) . foldl' update' matrixIn $ usersRatings  where update' oldMatrix userRatings =          foldl' (\oldMatrix (itemPair, rating) -> insert oldMatrix itemPair rating)                 oldMatrix itemCombos          where itemCombos = [ ((item1, item2), (1, rating1 - rating2))                              | (item1, rating1) <- ratings                             , (item2, rating2) <- ratings                             , item1 /= item2]                ratings = M.toList userRatings        insert outerMap (item1, item2) newRating = M.insertWith' outer item1 newOuterEntry outerMap          where newOuterEntry = M.singleton item2 newRating                outer _ innerMap = M.insertWith' addT item2 newRating innerMap        norm (count,total_rating) = (count, total_rating / fromIntegral count)        un_norm (count,rating) = (count, rating * fromIntegral count)        matrixIn = M.map (M.map un_norm) matrixInNormed  -- This version of update2 makes an unnormalize slopeOne' from each-- Rating and combines them using Map.union* operations and addT.update2 :: Ord item => SlopeOne' item -> [Rating item] -> SlopeOne' itemupdate2 s@(SlopeOne' matrixIn) usersRatingsIn | null usersRatings = s                                              | otherwise =    SlopeOne' . M.unionsWith (M.unionWith addT) . (matrixIn:) . map fromRating $ usersRatings  where usersRatings = filter ((1<) . M.size) usersRatingsIn        fromRating userRating = M.mapWithKey expand1 userRating          where expand1 item1 rating1 = M.mapMaybeWithKey expand2 userRating                  where expand2 item2 rating2 | item1 == item2 = Nothing                                              | otherwise = Just (1,rating1 - rating2) predict :: Ord a => SlopeOne a -> Rating a -> Rating apredict (SlopeOne matrixIn) userRatings =  let freqM = foldl' insert M.empty                     [ (item1,found_rating,user_rating)                     | (item1,innerMap) <- M.assocs matrixIn                     , M.notMember item1 userRatings                     , (user_item, user_rating) <- M.toList userRatings                     , item1 /= user_item                     , found_rating <- M.lookup user_item innerMap                     ]      insert oldM (item1,found_rating,user_rating) =        let (count,norm_rating) = found_rating            total_rating = fromIntegral count * (norm_rating + user_rating)        in M.insertWith' addT item1 (count,total_rating) oldM      normM = M.map (\(count, total_rating) -> total_rating / fromIntegral count) freqM  in M.filter (\norm_rating -> norm_rating > 0) normM -- This is a modified version of predict.  It also expect the-- unnormalized SlopeOne' but this is a small detailpredict' :: Ord a => SlopeOne' a -> Rating a -> Rating apredict' (SlopeOne' matrixIn) userRatings =    M.mapMaybe calcItem (M.difference matrixIn userRatings)  where calcItem innerMap | M.null combined = Nothing                          | norm_rating <= 0 = Nothing                          | otherwise = Just norm_rating          where combined = M.intersectionWith weight innerMap userRatings                (total_count,total_rating) = foldl1' addT (M.elems combined)                norm_rating = total_rating / fromIntegral total_count        weight (count,rating) user_rating =          (count,rating + fromIntegral count *  user_rating) userData :: [Rating String]userData = map M.fromList [ [("squid", 1.0), ("cuttlefish", 0.5), ("octopus", 0.2)], [("squid", 1.0), ("octopus", 0.5), ("nautilus", 0.2)], [("squid", 0.2), ("octopus", 1.0), ("cuttlefish", 0.4), ("nautilus", 0.4)], [("cuttlefish", 0.9), ("octopus", 0.4), ("nautilus", 0.5)] ] userInfo = M.fromList [("squid", 0.4),("cuttlefish",0.9),("dolphin",1.0)] predictions = predict (update empty userData) userInfo predictions' = predict' (update2 empty' userData) userInfo

2 More optimized storage

The changes to SlopeOne/update/predict below use a different internal data structure for storing the sparse matrix of SlopeOne. Instead of a Map of Map design it uses a Map of List design and keeps the List in distinct ascending form. The list values are a strict (data Tup) type which should help save space compared to the previous inner Map design and yet efficiently provide all the operations needed by update and predict.

Much of the logic of prediction is in the computeRating helper function.

-- The SlopeOne matrix is indexed by pairs of items and is implemented-- as a sparse map of distinct ascending lists.  The 'update' and-- 'predict' functions do not need the inner type to actually be a-- map, so the list saves space and complexity.newtype SlopeOne item = SlopeOne (M.Map item [Tup item])  deriving (Show) -- Strict triple tuple type for SlopeOne internalsdata Tup item = Tup {
itemT :: !item, countT :: !Count, ratingT :: !RatingValue } deriving (Show) empty :: SlopeOne itemempty = SlopeOne M.empty update :: Ord item => SlopeOne item -> [Rating item] -> SlopeOne itemupdate s@(SlopeOne matrixIn) usersRatingsIn | null usersRatings = s | otherwise = SlopeOne . M.unionsWith mergeAdd . (matrixIn:) . map fromRating $ usersRatings where usersRatings = filter ((1<) . M.size) usersRatingsIn -- fromRating converts a Rating into a Map of Lists, a singleton SlopeOne. fromRating userRatings = M.mapWithKey expand userRatings where expand item1 rating1 = map makeTup . M.toAscList . M.delete item1 $ userRatings where makeTup (item2,rating2) = Tup item2 1 (rating1-rating2) -- 'mergeAdd' is a helper for 'update'.-- Optimized traversal of distinct ascending lists to perform additive merge.mergeAdd :: Ord item => [Tup item] -> [Tup item] -> [Tup item]mergeAdd !xa@(x:xs) !ya@(y:ys) = case compare (itemT x) (itemT y) of LT -> x : mergeAdd xs ya GT -> y : mergeAdd xa ys EQ -> Tup (itemT x) (countT x + countT y) (ratingT x + ratingT y) : mergeAdd xs ysmergeAdd xs [] = xsmergeAdd [] ys = ys -- The output Rating has no items in common with the input Rating and-- only includes positively weighted ratings.predict :: Ord item => SlopeOne item -> Rating item -> Rating itempredict (SlopeOne matrixIn) userRatings = M.mapMaybe (computeRating ratingList) (M.difference matrixIn userRatings) where ratingList = M.toAscList userRatings -- 'computeRating' is a helper for 'predict'.-- Optimized traversal of distinct ascending lists to compute positive weighted rating.computeRating :: (Ord item) => [(item,RatingValue)] -> [Tup item] -> Maybe RatingValuecomputeRating !xa@(x:xs) !ya@(y:ys) = case compare (fst x) (itemT y) of LT -> computeRating xs ya GT -> computeRating xa ys EQ -> helper (countT y) (ratingT y + fromIntegral (countT y) * snd x) xs ys where helper :: (Ord item) => Count -> RatingValue -> [(item,RatingValue)] -> [Tup item] -> Maybe RatingValue helper !count !rating !xa@(x:xs) !ya@(y:ys) = case compare (fst x) (itemT y) of LT -> helper count rating xs ya GT -> helper count rating xa ys EQ -> helper (count + countT y) (rating + ratingT y + fromIntegral (countT y) * (snd x)) xs ys helper !count !rating _ _ | rating > 0 = Just (rating / fromIntegral count) | otherwise = NothingcomputeRating _ _ = Nothing

这个也是很全:

你可能感兴趣的文章
python爬虫解决极验验证码问题
查看>>
使用JS将table表格导出为excel
查看>>
java调用阿里云短信服务接口
查看>>
idea的个性配置
查看>>
Java获取访问者Ip并限制Ip访问页面
查看>>
Java读取src下配置文件的问题
查看>>
网页加载时waiting(TTFB)时间过长的问题解决
查看>>
Java时间日期相关工具类
查看>>
JS使用OSS上传文件遇到的一些问题
查看>>
个人博客写了两年
查看>>
博客添加评论功能
查看>>
VMware Ubuntu安装教程(详细过程)
查看>>
Java新手的通病
查看>>
Java虚拟机知识汇总,Jvm面试必问
查看>>
LifecycleProcessor not initialized - call ‘refresh‘ before invoking lifecycl
查看>>
js实现选中div内容并复制到剪切板
查看>>
海康威视DS-K1F100-D8E发卡器Java版
查看>>
Java十六进制和byte数组转换
查看>>
蓝天白云也是一份美好
查看>>
Java调用海康HCUsbSDK制卡刷卡读卡
查看>>