use scrapy to collect pages relationship information and build page rank dataset
use hadoop and dataset collected by scrapy to implement page rank algorithm
We use scrapy to collect page rank dataset. The related code locates in the scrapy\ dir
- install scrapy first
pip install scrapy- run
scrapyinsidescrapy\
cd scrapy
scrapy crawl pagerank- change start_urls and allowed_domains (option)
the default start_urls=['https://github.com/'], we can change it through:
scrapy crawl pagerank -a start='https://stackoverflow.com/'we can restrict the collecting range into one domain:
scrapy crawl pagerank -a start='https://stackoverflow.com/' -a domain='stackoverflow.com'Be careful, domain should fit with start; If you don't set start, it should fit with the default start_urls
The data collected by spider will be stored into keyvalue, transition and pr0 respectively.
keyvalue records the each url and its key, splited by '\t':
0\thttps://github.com/\n
1\thttps://github.com/features\n
2\thttps://github.com/business\n
3\thttps://github.com/explore\n
...
transtion records the relationship betwenn pages:
0\t0,1,2,3,4,5,6,7,8,9\n
...
page of id 0 points to pages of id 0,1,2,3,4,5,6,7,8,9, they are splited by '\t'
pr0 is the initial page rank value for each page:
0\t1\n
1\t1\n
2\t1\n
...
Later, we need to put transition and pr0 into hdfs and use hadoop to calculate page rank.
We use hadoop to implement the page rank algorithm
- compile and package hadoop project by maven
cd hadoop
mvn package -DskipTests- put
transitionandpr0produced by spider intohdfs
hdfs dfs -mkdir -p pagerank/transition/
hdfs dfs -put ../scrapy/transition pagerank/transition/
hdfs dfs -mkdir -p pagerank/pagerank/pr0
hdfs dfs -put ../scrapy/pr0 pagerank/pagerank/pr0/
hdfs dfs -mkdir -p pagerank/keyvalue/
hdfs dfs -put ../scrapy/keyvalue pagerank/keyvalue/
hdfs dfs -mkdir -p pagerank/cache- deal with MySQL
- create MySQL database
mysql -uroot -p
create database PageRank;
use PageRank;
create table output(page_id VARCHAR(250), page_url VARCHAR(250), page_rank DOUBLE);
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;- download connector and put it into HDFS
hdfs dfs -mkdir /mysql
hdfs dfs -put mysql-connector-java-*.jar /mysql/- run hadoop project
hadoop jar page-rank.jar Driver pagerank/transition pagerank/pagerank/pr pagerank/keyvalue pagerank/cache/unit 10Driveris the entry classpagerank/transition/is the dir of transition in hdfspagerank/pagerank/pris the base dir of pr0 in hdfspagerank/keyvalueis the dir of keyvalue in hdfspagerank/unitis the dir of middle results10is the times of convergence