R you want code?

Thursday, July 24, 2014

Review to Beautiful Data.

Recently I read a book called "Beautiful Data" (the Story behind Elegant Solutions) a good book, visualization from to the basic until intermediate chapters, easy reading, and with some examples in many chapters.

The firsts chapters is about how the world is constantly watched by many apps that tracks informations, and many studies observing pattern in consumption, text (twitter, facebook status), and locations. That's isn't new, but this perspective let him explain new patterns and a little more show how these persons did it that?.

this isn't a new book, was published at 2009, so many of examples like the chapter sixteen, not avalaible:

but the code that show could be usefull for applied to another dataset, for exploratory analize, I used the dataset about the wikipedia campaign vs amount obtain.

The book Explain that R have many tools for do that, here some examples:

url='http://samarium.wikimedia.org/campaign-vs-amount.csv';
wiki = read.delim(url,sep=",");
plot(wiki);

the command above makes a scatter plot matrix:

This should be a first look of the data, 9 variables here could observe that some variables looks like a data type factor or date and not numeric. columns medium, campaign, stop_data, start_date should be cast in the correct datatype, always is better do it.

for this we could use: as.factor(variable), as.numeric(variable), as.Date(variable)

hist(wiki$usdmax) # make a histogram

however we could see, that this doesn't look right, so we could look for extrem values or to see more distribuited the data.

index = which(wiki$usdmax < 800)
wikiless = wiki[index,]
hist(wikiless$usdmax)

so later we could see for what medium wikipedia get more donations:

plot(wiki$avg, wiki$medium, col=wiki$medium) #(Imagen 1)
plot(wiki$usdmax, wiki$medium, col=wiki$medium) # (imagen 2) 
summary(wiki$medium)

( imagen 1)

(imagen 2)

In the summary we could see which medium had more donations, and that was via email and via sitenotice.

library(ggplot2);
qplot(wiki$sum, wiki$medium);
qplot(wiki$avg, wiki$medium, col=wiki$medium);
qplot(wiki$usdmax, wiki$medium, col=wiki$medium);
smoothScatter(log(wiki$count), log(wiki$avg));
smoothScatter(log(wiki$count), log(wiki$avg),
     colramp=colorRampPalette(c('white', 'blue')));
smoothScatter(log(wiki$count), log(wiki$usdmax), 
     colramp=colorRampPalette(c('white', 'deeppink')));

Here a simple plot made with ggplot2 library, in a easy form, something like that:

wiki$stop_date = as.numeric(wiki$stop_date)
wiki$start_date = as.numeric(wiki$start_date)
wikinumeric = wiki[,-c(1,2)]
cors = cor(wikinumeric, use='pair')
require(lattice)
levelplot(cors)
#image(cors, col=col.corrgram(7)) 
#this doesn't work now is:
# rainbow, heat.colors, topo.colors, terrain.colors 
image(cors, col = heat.colors(7))
axis(1, at=seq(0,1, length=nrow(cors)), labels=row.names(cors))

All this command and scripts could be found in the chapter seventeen of the book.

Data could be find here:

http://datahub.io/es/dataset

http://datahub.io/es/dataset/wikimedia-fundraiser

And R Code here:

https://github.com/j3nnn1/homework/blob/master/withR/vis/beautiful_data.R

Monday, August 26, 2013

Apriori algorithm with R

The apriori algorithm is used to discover association rules, and what is that?.

Association rules is about discover pattern in data, usually transactional data, like sales (each product when you do a purchase is an item), temporal events (each purchase with sequencial order), and could be used in texts (where each item would be a word ).

So what is the trick behind that?, apriori algorithm mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.

Here a few concepts association rules.

Support: it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.

S(x) = X/N

Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.

So, the transaction format could be:

Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with diferents products which referer to the same transaction. here a example:

Basket sparse sequential.

Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.

Basket.

Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)

Well first, we need to install these packages, "arules", "arulesViz", "arulessecuences".
R use the format basket sparse and single, here I used format basket sparse.

install.packages("arules");
install.packages("arulesViz");
install.packages("arulesSecuences");

We need to define the support and the confidence,

you could edit this in the file arules.r

support1 = c(0.2) #it's a low support because

                  #I want to see what happens
                  #at this level

support2 = c(0.7)   # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8

tr = read.transactions("transacciones.basket",
                       sep=',',
                       cols=c(1),
                       format="basket");
image(tr);
summary(tr);

Image plot is like a heatmap where we can see where a cluster is
or which are the products more buyed. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.

itemFrequencyPlot(tr, supp=support1)

the command above makes this graph:

And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:

rules = apriori(tr, parameter= list(supp=support1, conf=confidence))
inspect(rules)
plot(rules, method="graph", control=list(type="items"))
plot(rules, method="grouped")

References:

Introduction to Data Mining,

Follow me Follow @j3nnn1