Saturday, November 8, 2014

Antonio Gulli and Machine Learning

These are some of the visual results of my previous experience in Machine Learning


Use of Machine Learning to build Bing Autosuggest



Use of Machine Learning for integration with facebook



Use of Machine Learning to build Bing News engine



Use of Machine Learning to build Ask.com Universal Results



Use of Machine Learning to build TheDailyBeast



The use of Machine Learning to detect image similarities (AKA Andy Warhol's period)




Monday, September 22, 2014

Saturday, September 20, 2014

Hands on big data - Crash Course on Spark - Start 6 nodes cluster - lesson 9

One easy way to start a cluster is to leverage those created by Amplab


After creating the keywords for AWS, I created the cluster but had to add -w 600 for timeout


Deploy (about 30mins including all the data copy)



Login


Run the interactive scala shell which will connect to the master


Run commands



Instances on AWS


Friday, September 19, 2014

Hands on big data - Crash Course on Spark - PageRank - lesson 8

Let's compute PageRank. Below you will find the definition.

Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations) 


we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank 


The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions

 And avoid the shuffle

However, one can directly use the SparkPageRank code distributed with Spark


Here I take a toy dataset


And compute the PageRank


Other dataset are here

https://snap.stanford.edu/data/web-Google.html




Thursday, September 18, 2014

Wednesday, September 17, 2014

Hands on big data - Crash Course on Spark Cache & Master - lesson 6

Two very important aspects for Spark are the use of caching in memory for avoiding re-computations and the possibility to connect a master from the spark-shell

Caching
Caching is pretty simple. Just use .persist() at the end of the desired computation. There are also option to persist an computation on disk or to replicate among multiple nodes. There is also the option to persist objects in in-Memory shared file-systems such as Tachyon

Connect a master
There are two ways for connecting a master

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
This will connect the local master with 8 cores

$ ./bin/spark-shell --master local[8]


Tuesday, September 16, 2014

Hands on big data - Crash Course on Spark - Word count - lesson 5

Let's do  simple exercise of word count. Spark will automatically parallelize the code


1. Load the bible
2. create a flatMap where every line is put in correspondence with words
3. Map the words to tuples (words, counter where counter is simple starting with 1
4. Reduce by keys, where the reduce operation is just +



Pretty simple, isn't it?