2017年12月28日星期四

git使用手册

1、merge以后查看diff

git 做了 merge的时候，比如当前分支是 dev，需要merge rls的代码

git pull origin rls

做完merge以后，形成了3份代码：新代码、dev原来的代码、rls的代码，需要比较新代码和 dev原代码的不同，也要比较新代码和 rls的区别

git diff dev

git diff rls

2、merge的时候希望pom文件必选按照某个的

如果做了 merge以后，比如就是希望保留本地的 pom文件，或者就是希望保留 rls的pom文件

这个时候不需要做rebase，只需要

git checkout --ours unisound-commons-time/pom.xml

--ours 表示用我当前所在分支的版本。

--theirs表示，用另外一个分支的版本

3、结合12的一个案例

比如做了merge虽然没有冲突，但是git diff dev一看，有个文件并不想修改，所以还想保留merge以前的本地（dev）版本

就可以 git checkout --ours **file**

这里的概念 ours和theirs表示本地和远程 dev 和 rls

4、摘草莓

假如你想删掉一个分支，但是对于里面的一些commit想保留到另一个分支上，这个时候就可以做摘草莓的技术

5、从一个commit创建一个分支

git branch **_before_merge 679856aa5
git checkout **_before_merge

查看某次commit提交的文件

git show --name-only e893e3175

6、eclipse 换行符

CRLF换行回车是windows使用的行结束符；而LF换行是linux使用的行结束符；所以两者会发生问题。

可以修改eclipse设置使得本地使用linux的行结束符；也可以设置git的参数 core.autocrlf

官方文档Formatting and Whitespace

core.autocrlf

》》1、如果你在windows编程，但是其他人用非windows编程 you’ll probably run into line-ending issues at some point.

git如何解决这个问题？

git会自动将CRLF换行回车转为LF换行，当你做add a file to index的时候；与此对应，checkout的时候也会将LF换行转为CRLF换行回车需要设置:

$ git config --global core.autocrlf true

》》2、如果你是linux编程，不想git自动行结束符转换，但是如果一个文件意外的行结束符是CRLF你希望git修复他，这个时候使用 core.autocrlf为input

If you’re on a Linux or Mac system that uses LF line endings, then you don’t want Git to automatically convert them when you check out files; however, if a file with CRLF endings accidentally gets introduced, then you may want Git to fix it. You can tell Git to convert CRLF to LF on commit but not the other way around by setting core.autocrlf to input:

$ git config --global core.autocrlf input

This setup should leave you with CRLF endings in Windows checkouts, but LF endings on Mac and Linux systems and in the repository.

7、merge冲突是因为对方删除而导致

Unmerged paths:
  (use "git add/rm ..." as appropriate to mark resolution)

        deleted by them:    src/main/java/com/example/filevisitor/PrintingFileVisitor.java

Resolving this type of conflict is pretty easy. You just have to tell Git whether you want to keep the file in your current branch using command:

$ git add file_name

or if you want to remove it completely:

$ git rm file_name

8、两次git pull

这里的前提假设是你的分支要merge到dev分支

git pull 自己的分支；

Git pull dev

这样可以省好多不必要的麻烦

9、如果有一个文件你做过commit，但是你想忽略掉不做更新了怎么办？

If you want to ignore a file that you've committed in the past, you'll need to delete the file from your repository and then add a .gitignore rule for it. Using the --cached option with git rmmeans that the file will be deleted from your repository, but will remain in your working directory as an ignored file.

首先进入.gitignore 文件所在的目录
$ echo debug.log >> .gitignore
如何查看gitignore项是否生效？
$ git check-ignore -v filename
$ git rm --cached debug.log
rm 'debug.log'
$ git commit -m "Start ignoring debug.log"

2014年2月7日星期五

correlation-base distance in clustering

this distance is suitable for clustering when you pay much attention on two variable correlation

》》》》层次聚类的树状结构（Dendrogram）的解释：

对于含有n个样本的dendrogram而言，必然是有n-1次fuse into branch，最后才会只有一个branch，因为这个dendrogram最后只有一个branch。而这n-1次的fuse into branch的过程，因为是二叉树，所以可以是任意选择两个的，所以说there are 2^(n-1) possible reordings of the dendrogram,where n is the number of leaves(就是样本个数)。

对于dendrogram中的每个叶子代表了一个样本，as we move up the tree,某些leaves开始做fuse into branches（就是叶子合并到branch上，一个brach上可以有多个叶子）。越早做fuse（lower in the tree），表示这一组的样本之间距离越近。----可以看出每次fuse必然会形成一个group组的概念--------。所以如果你想看两个样本的距离，就去look for the point in the tree where branches containing those two 样本 are first fused,也就是通过纵坐标来衡量。

》》》》如何判定一个dendrogram的好坏？如下图所示：粗体代表了两个变量之间是有线性相关的，其他的两个变量时没有关系的。相关系数矩阵如下所示

我们采用四个 correlation有关的距离函数来做衡量

Dissimilarity = 1 - Correlation
Dissimilarity = (1 - Correlation)/2
Dissimilarity = 1 - Abs(Correlation)
Dissimilarity = Sqrt(1 - Correlation²)

得到的图像如下所示

这个距离的选择为何是不好的？

1-correlation图中的变量M10B和 M02A, M10A 和 P00B竟然聚在了一块，而他们的实际在表格中是不相关的接近0；

(1-correlation)/2仅仅是对纵坐标做了scaling，没有改变实际形状。

1-Abs(Correlation)效果是不错的，不相关的变量没有聚在一块，注意比如P00A and P00B没有聚集，因为他们没有相关系数.

Sqrt(1-Correlation2)也是不错的, 但是缩减了垂直spread .这种场景适用于 when only a small number of highly correlated clusters are desired.

》》》》总结就是：没有线性相关的变量不会聚集在一个group

还有就是做一次fuse into branch 就会做了个一个group

这个tree自底向上表示相关度逐渐减弱。

reference:

http://research.stowers-institute.org/mcm/efg/R/Visualization/cor-cluster/index.htm

2013年5月17日星期五

二项分布以及在R语言里的函数

伯努力分布： 一次试验只有两个结果，要么成功要么失败。当然成功失败的概率不见得是相等的。比如一到单项选择题，有5个备选答案，成功的概率是0.2，失败的概率是0.8.所以试验的结果记为X，成功或者失败（1或者0）.

二项分布： 伯努力试验执行n次数，每次实验是iid。这时要注意，试验结果记为X---表示n次试验成功的次数，X可不再是成功和失败了，而是和试验次数有关系。

所以二项分布就是伯努力试验的多次进行。

二项分布：每次实验有两个结果，成功/失败？当然成功和失败的概率不见得是相等的！X-标识n次试验成功的次数，X的概率密度函数

这个函数在R语言里有个公式可以直接计算就是

dbinom(X,n,p)------表示 X~B(n,p)；

其实就是f(X|theta)

如下为一个例子：例如12道单选题目，没道题目有5个选项。假设某个学生做随机选择，

1）、作对4道题目的概率

P(X=4)=

Dbinom(4,size=12,prob=0.2)

2)、作对题目<=4道的概率

dbinom(0,12,0.2)+ dbinom(1,12,0.2)+dbinom(2,12,0.2)+…….+dbinom(4,12,0.2)

=sum( dbinom(0:4),12,0.2 )

=pbinom(4,size=12,prob=0.2)

3）作对>4道题目的概率

=1-pbinom(4,12,0.2)

=sum( dbinom(5:12,12,0.2) )

=pbinom( 4,12,0.2,low.tail=FALSE )

4)产生n个数，每个变量服从二项分布B(size,prob)------产生n个服从二项分布的随机数

Rbinom(n,size,prob)

2013年5月6日星期一

关于推荐系统应该考虑的问题

关于推荐系统应该考虑的问题：

一、如何避免“多次推荐的问题”？

如果我已经购买了电子产品，广告系统会反复推荐这样的东西。我已经购买了，还会继续购买同样的产品吗？

其实要考虑商品的特点：

比如：柴米油盐，每天都要消耗的就可以的！

此外：

可以增加商品属性：

》如购买周期

》》同类商品排斥程度{解释何为排斥程度？该属性是是类目商品的属性，如笔记本电脑买了之后，笔记本电脑这个类目的商品对于该用户排斥程度为0.8(排斥指的是有了一件该商品之后对于本类商品接受程度)，相当于这类计算相关度时权重降低}

就可以解决讨论里说的推荐的东西已经买过的现象吧。古老的电话推销员都记录有上次成功推销的人所购买的商品什么时候使用到期了

二、要考虑外部环境数据

基于BI做出的推荐肯定是用户的过去的需求分析，预测的结果还要加上外部环境数据,比如：季节、当时竞争环境、特殊事件类似这些，这些东西通过历史数据不能得出结论的

需要提前预测，否则当发现数据异常，做运营调整的成本就大了

2012年10月18日星期四

python os.path常用函数介绍

便利某个目录下所有的文件，将所有便利到的文件保存到指定文件命中

import os,sys
#参数dir是欲便利的文件，file是保存结果的文件
#自定义函数
def listfileindir(dir,file):
file.write(dir+'\n')
filenum=0
list=os.listdir(dir)
for line in list:
filepath=os.path.join(dir,line)
if os.path.isdir(filepath):
file.write(' '+line+'\\'+'\n')
for li in os.listdir(filepath):
file.write(' '+li+'\n')
filenum=filenum+1
else:
file.write(' '+line+'\n')
filenum=filenum+1
file.write('all the file num is'+str(filenum) )
#注意raw_input的作用是提示用户输入，并返回用户输入结果
dir=raw_input('please input the path:')
myfile=open('list.txt','w')
#调用自定义函数
listfileindir(dir,myfile)
-----------该函数主要利用了os.path这个包，下面详解一下os.path这个包--------------

os.path主要讲解一下三个部分

[一]基本操作

basename('文件路径')

`1`	`import` `os`

`2`	`os.path.basename('/Volumes/1.mp4')` `#输出('1.mp4')`

dirname('文件路径')

`1`	`import` `os`

`2`	`os.path.dirname('/Volumes/1.mp4')` `#输出('/Volumes')`

splitdrive('文件路径')

`1`	`import` `os`

`2`	`os.path.splitdrive('Volumes/1.mp4')` `#输出('','/Volumes/1.mp4')`

.pa

th.

split

('文件路径')

`1`	`import` `os`

`2`	`os.path.split('/Volumes/1.mp4')` `#输出（‘/Volumes’，‘1.mp4’）`

`3`	`os.path.split('/Volumes/text')` `#输出（‘/Volumes’，‘text’）`

os.path.splitext(“文件路径”)

`1`	`import` `os`

`2`	`fname, fextension=os.path.splitext('/Volumes/Leopard/Users/Caroline/Desktop/1.mp4')`

`3`	`print` `fname,fextension` `#输出/Volumes/Leopard/Users/Caroline/Desktop/1 .mp4`

`4`	`os.path.splitext('/Volumes/Leopard/Users/Caroline/Desktop/1.mp4')[1:]` `#输出('.mp4',)`

os.path.join('a','b','fname.extension')->'a/b/fname.extension'

`1`	`imprort os`

`2`	`os.path.join('a','b','1.mp4')` `#输出#‘a/b/1.mp4’`

[二]查询：返回值True，False

exists() 指定路径（文件或者目录）是否存在

isabs() 指定路径是否为绝对路径

isdir() 指定路径是否存在且为一个目录

isfile() 指定路径是否存在且为一个文件

islink() 指定路径是否存在且为一个符号链接

ismount() 指定路径是否存在且为一个挂载点？？？

samefile() 两个路径名是否指向同一个文件

【三】文件信息

getatime() 返回最近访问时间（浮点型秒数）

getctime() 返回文件创建时间

getmtime() 返回最近文件修改时间

getsize() 返回文件大小（字节为单位）

abspath() 返回绝对路径

normpath() 规范path字符串形式？？？

`01`	`import` `os`

`02`	`import` `time`

`03`	`file='/Volumes/Leopard/Users/Caroline/Desktop/1.mp4'`

`04`	`os.path.getatime(file)` `#输出最近访问时间1318921018.0`

`05`	`os.path.getctime(file)` `#输出文件创建时间`

`06`	`os.path.getmtime(file)` `#输出最近修改时间`

`07`	`time.gmtime(os.path.getmtime(file))` `#以struct_time形式输出最近修改时间`

`08`	`os.path.getsize(file)` `#输出文件大小（字节为单位）`

`09`	`os.path.abspath(file)` `#输出绝对路径'/Volumes/Leopard/Users/Caroline/Desktop/1.mp4'`

`10`	`os.path.normpath(file)` `#输出'/Volumes/Leopard/Users/Caroline/Desktop/1.mp4'`

订阅：博文 (Atom)