其他分享
首页 > 其他分享> > ”2021年安徽省大数据与人工智能应用竞赛“ 大数据(网络赛)-本科组赛题 第二部分:大数据预处理部分 赛题回顾

”2021年安徽省大数据与人工智能应用竞赛“ 大数据(网络赛)-本科组赛题 第二部分:大数据预处理部分 赛题回顾

作者:互联网

在这里插入图片描述

本科组赛题 第二部分:大数据预处理部分

赛题数据

注意: 文本的编码为 UTF-8

数据1:数据calls.txt 通话记录

7,18000696806,赵贺彪
8,15151889601,张倩
9,13269361119,王世昌
10,15032293356,张涛
11,17731088562,张阳
12,15338595369,李进全
13,15733218050,杜泽文
14,15614201525,任宗阳
15,15778423030,梁鹏
16,18641241020,郭美彤
17,15732648446,刘飞飞
18,13341109505,段光星
19,13560190665,唐会华
20,18301589432,杨力谋
21,13520404983,温海英
22,18332562075,朱尚宽
23,18620192711,刘能宗
24,13566666666,刘柳
25,13666666666,邓二
26,13799999999,菜中路


数据样例解释说明:
数据calls.txt 通话记录 
样例:18620192711,15733218050,1506628174,1506628265,650000,810000
字段分别为:
呼叫者手机号,接受者手机号,开始时间戳,接受时间戳,呼叫者地址省份编码,接受者地址省份编码

数据2:数据location.txt 地区编码对应表

1,110000,北京市
2,120000,天津市
3,130000,河北省
4,140000,山西省
5,150000,内蒙古自治区
6,210000,辽宁省
7,220000,吉林省
8,230000,黑龙江省
9,310000,上海市
10,320000,江苏省
11,330000,浙江省
12,340000,安徽省
13,350000,福建省
14,360000,江西省
15,370000,山东省
16,410000,河南省
17,420000,湖北省
18,430000,湖南省
19,440000,广东省
20,450000,广西壮族自治区
21,460000,海南省
22,500000,重庆市
23,510000,四川省
24,520000,贵州省
25,530000,云南省
26,540000,西藏自治区
27,610000,陕西省
28,620000,甘肃省
29,630000,青海省
30,640000,宁夏回族自治区
31,650000,新疆维吾尔自治区
32,710000,台湾省
33,810000,香港特别行政区
34,820000,澳门特别行政区


数据样例解释说明:
数据location.txt 地区编码对应表
样例:1,110000,北京市
字段分别为: 
地址id,省份编码,省份名称

数据3:数据userPhone.txt 是手机号与姓名对应表

18620192711,15733218050,1506628174,1506628265,650000,810000
18641241020,15733218050,1509757276,1509757464,330000,620000
15778423030,15614201525,1495290451,1495290923,370000,420000
13341109505,15151889601,1492661762,1492662200,330000,460000
13341109505,13666666666,1470111026,1470111396,360000,230000
15032293356,13799999999,1495937181,1495937360,500000,630000
15733218050,13341109505,1452601976,1452602401,620000,530000
13269361119,13269361119,1487640690,1487641023,450000,430000
13799999999,15338595369,1511928814,1511929111,540000,230000
15733218050,15778423030,1542457633,1542457678,450000,530000
13341109505,17731088562,1484364844,1484365342,460000,360000
18332562075,15778423030,1522426275,1522426473,140000,120000
13560190665,18301589432,1485648596,1485648859,620000,820000
15733218050,13520404983,1538992531,1538992605,130000,150000
15778423030,13566666666,1484008721,1484009210,810000,330000
13566666666,17731088562,1541812913,1541813214,220000,360000
15778423030,15733218050,1464198621,1464198803,630000,340000
15151889601,13341109505,1467441052,1467441538,640000,440000
18620192711,13666666666,1510997876,1510998253,450000,610000
13341109505,18641241020,1509074946,1509075201,710000,310000
17731088562,13341109505,1471571270,1471571706,430000,630000
13520404983,13560190665,1476626194,1476626683,500000,440000
15338595369,13341109505,1523996031,1523996059,420000,460000
15151889601,13341109505,1489658199,1489658394,330000,500000
13560190665,15338595369,1510890681,1510891129,410000,520000
15733218050,13566666666,1503498540,1503498726,420000,310000
17731088562,13560190665,1470571255,1470571708,540000,330000
15338595369,15614201525,1496767879,1496768364,520000,500000
17731088562,15778423030,1494602567,1494602784,500000,420000
15778423030,18641241020,1517445007,1517445358,450000,530000
13566666666,17731088562,1464697765,1464697894,360000,620000
15778423030,13799999999,1525543218,1525543493,500000,820000
13341109505,13520404983,1521861238,1521861421,500000,130000
13566666666,13560190665,1513918160,1513918538,340000,210000
15032293356,18620192711,1485688388,1485688537,540000,530000
13799999999,13341109505,1531196363,1531196438,230000,320000
15338595369,15151889601,1512125514,1512125978,540000,810000
18332562075,13560190665,1523311951,1523312239,650000,410000
15778423030,15032293356,1467953782,1467954054,810000,540000
15151889601,15733218050,1530848147,1530848231,310000,150000
13269361119,18301589432,1541271874,1541272273,310000,310000
15032293356,15338595369,1520833915,1520834201,450000,360000
15778423030,13269361119,1452817391,1452817596,820000,410000
13520404983,18332562075,1474563316,1474563593,710000,540000
18301589432,15778423030,1473596284,1473596528,620000,310000
15732648446,15151889601,1535584645,1535585117,530000,310000
18301589432,13269361119,1511910316,1511910341,340000,320000
13560190665,18641241020,1533379659,1533379717,120000,710000
15338595369,18332562075,1474152847,1474153092,330000,500000
13520404983,17731088562,1504907456,1504907617,820000,510000
15732648446,18301589432,1521692836,1521692977,220000,370000
15032293356,15614201525,1471445293,1471445756,360000,530000
18641241020,15778423030,1517192728,1517193050,210000,610000
17731088562,15733218050,1493420249,1493420555,370000,820000
18620192711,13799999999,1477952709,1477953088,310000,140000
13666666666,13799999999,1541066076,1541066541,230000,640000
13269361119,17731088562,1540060141,1540060511,150000,540000
18332562075,13799999999,1489772390,1489772817,540000,710000
13799999999,15732648446,1503882021,1503882332,530000,520000
13566666666,15614201525,1504983084,1504983241,820000,140000
18641241020,15032293356,1463447030,1463447080,330000,640000
18301589432,13566666666,1493646451,1493646796,310000,510000
15732648446,15032293356,1537185125,1537185619,430000,810000
15338595369,13341109505,1493411872,1493411891,370000,150000
15778423030,17731088562,1540631847,1540632271,320000,500000
13666666666,15614201525,1545200734,1545200959,360000,640000
15032293356,13799999999,1455000970,1455001084,460000,650000
18641241020,18620192711,1529968498,1529968626,410000,510000
17731088562,15732648446,1455361378,1455361505,440000,650000
18301589432,13666666666,1518564232,1518564421,210000,640000
15733218050,18620192711,1515672794,1515673149,360000,360000
13520404983,18620192711,1521620546,1521620913,820000,370000
18332562075,18641241020,1498131159,1498131300,820000,230000
13666666666,18301589432,1491354142,1491354544,220000,710000
18301589432,15614201525,1511731560,1511732015,810000,620000
13269361119,13666666666,1539065031,1539065096,810000,810000
15778423030,18641241020,1518364528,1518364995,130000,610000
15733218050,15032293356,1491974898,1491975316,340000,810000
13269361119,15733218050,1543514850,1543514946,410000,460000
13341109505,13666666666,1482223100,1482223577,220000,410000
15338595369,13341109505,1495958992,1495959292,330000,420000
13341109505,18641241020,1511010003,1511010292,540000,620000
18620192711,13269361119,1462453298,1462453559,320000,360000
13666666666,13799999999,1518047527,1518047967,640000,420000
13341109505,13666666666,1474872886,1474872907,360000,510000
13666666666,18641241020,1473575493,1473575663,150000,520000
15151889601,15732648446,1509418483,1509418891,510000,540000
13560190665,13520404983,1467696946,1467697103,150000,460000
13520404983,15614201525,1510958686,1510959064,320000,610000
15778423030,15614201525,1470012457,1470012660,210000,210000
15778423030,17731088562,1542680029,1542680382,630000,520000
18332562075,15338595369,1453896030,1453896522,640000,370000
15032293356,18620192711,1488286898,1488287248,530000,150000
18641241020,15733218050,1489804133,1489804185,150000,630000
15733218050,13666666666,1506782751,1506782854,220000,500000
13520404983,17731088562,1487421622,1487421784,230000,330000
15151889601,13269361119,1538113862,1538113902,370000,630000
15778423030,17731088562,1466691118,1466691412,540000,530000
15032293356,13520404983,1521151509,1521151701,520000,430000
15614201525,13666666666,1464083166,1464083352,330000,650000


数据样例解释说明:
数据userPhone.txt 是手机号与姓名对应表
样例:26,13799999999,菜中路
字段分别为:
电话ID,电话号码,姓名

1、电信通话数据处理,与分析(10分)

请根据要求把 通话记录表 转换为 新的格式数据。
要求把 通话记录表 呼叫者手机号,接受者手机号 替换为 姓名,
开始时间与结束时间 转换成时间格式为 yyyy-MM-dd HH:mm:ss,例如2017-03-29 10:58:12;
计算通话时间,并以秒做单位 计算为通话时间=结束时间-开始时间
将呼叫者地址省份编码,接受者地址省份编码 替换成省份名称

1.将电话号码替换成人名
2.将拨打、接听电话的时间戳转换成日期
3.求出电话的通话时间,以秒做单位
4.将省份编码替换成省份名称
5.最后数据的样例:

邓二,张倩,2018-03-29 10:58:12,2018-03-29 10:58:42,30秒,黑龙江省,上海市

运行过程截图:
在这里插入图片描述

运行结果目录截图:
在这里插入图片描述

结果文件内容截图:
在这里插入图片描述

代码截图(要能体现出 上面的 1,2,3,4 点需求要求的代码):

package com.BMC.MapReduce.ahjs.ShengSaiChuSai;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Hashtable;

public class Test01 {

    //通话记录数据样例:18620192711,15733218050,1506628174,1506628265,650000,810000
    //map job
    public static class MyMapper extends Mapper<LongWritable, Text,Text, NullWritable>{

        //存储 呼叫者地址省份编码   地址省份名
        Hashtable<String ,String> location=null;

        //存储 呼叫者手机号   姓名
        Hashtable<String ,String> userPhone=null;

        @Override
        protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
            //获取文件系统
            FileSystem fs = FileSystem.get(context.getConfiguration());

            /**
             * 处理location.txt 将地址编码 地址名 存在Hashtable
             */
            FSDataInputStream fsDataInputStream = fs.open(new Path("/MRinput/input04_ahjs_file/location.txt"));
            BufferedReader br = new BufferedReader(new InputStreamReader(fsDataInputStream));
            location= new Hashtable<>();
            String line=null;
            while ((line= br.readLine())!=null){
                String[] split = line.split(",");
                String id = split[1];
                String name = split[2];
                location.put(id,name);
            }

            /**
             * 处理userPhone.txt文件读取并构建成hashtable
             */
            FSDataInputStream fsDataInputStream1 = fs.open(new Path("/MRinput/input04_ahjs_file/userPhone.txt"));
            BufferedReader br1 = new BufferedReader(new InputStreamReader(fsDataInputStream1));
            userPhone = new Hashtable<>();
            while ((line=br1.readLine())!=null){
                String phone = line.split(",")[1];
                String name = line.split(",")[2];
                userPhone.put(phone,name);
            }
        }

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

            /**
             * 1.将电话号码替换成人名
             * 2.将拨打、接听电话的时间戳转换成日期
             * 3.求出电话的通话时间,以秒做单位
             * 4.将省份编码替换成省份名称
             * 5.最后数据的样例:
             * 邓二,张倩,,2018-03-29 10:58:12,2018-03-29 10:58:42,30秒,黑龙江省,上海市
             * 通话记录数据:
             *      初始样例:
             *      18620192711,15733218050,1506628174,1506628265,650000,810000
             *字段分别为:
             * 呼叫者手机号,接受者手机号,开始时间戳,结束时间戳,呼叫者地址省份编码,接受者地址省份编码
             */
            //切分数据
            String[] splits = value.toString().split(",");
            //呼叫者手机号
            String caller = splits[0];
            //接受者手机号
            String receiver = splits[1];
            //开始时间戳
            String startTime = splits[2];
            //结束时间戳
            String endTime = splits[3];
            //呼叫者地址省份编码
            String callerPID = splits[4];
            //接受者地址省份编码
            String receiverPID = splits[5];

            //1.将电话号码替换成人名
            String callerName = userPhone.getOrDefault(caller, "姓名为空");
            String receiverName = userPhone.getOrDefault(receiver, "姓名为空");

            //2.将拨打、接听电话的时间戳转换成日期
            //创建 时间格式化 格式yyyy-MM-dd HH:mm:ss
            SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            //格式化时间
            String startTimeFormat = simpleDateFormat.format(new Date(Long.parseLong(startTime)));
            String endTimeFormat = simpleDateFormat.format(new Date(Long.parseLong(endTime)));

            //3.求出电话的通话时间,以秒做单位
            Long time = Long.parseLong(endTime) - Long.parseLong(startTime);

            //4.将省份编码替换成省份名称
            String callerPName = location.getOrDefault(callerPID, "地址为空");
            String receiverPName = location.getOrDefault(receiverPID, "地址为空");


            //输出map job 处理结果
            context.write(
                    new Text(callerName+","
                            +receiverName+","
                            +startTimeFormat+","
                            +endTimeFormat+","
                            +time+"秒"+","
                            +callerPName+","
                            +receiverPName+","),NullWritable.get());

        }
    }



    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);
        job.setJobName("Test01");

        job.setJarByClass(Test01.class);

        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.addInputPath(job,new Path("/MRinput/input04_ahjs_file/input_calls"));

        //判断文件路径是否已存在
        FileSystem fileSystem = FileSystem.get(job.getConfiguration());
        if (fileSystem.exists(new Path("/MRoutput/output_anjs_test01"))){
            fileSystem.delete(new Path("/MRoutput/output_anjs_test01"));
        }

        FileOutputFormat.setOutputPath(job,new Path("/MRoutput/output_anjs_test01"));

        job.waitForCompletion(true);

    }
}

2、请使用MapReduce统计 calls.txt中的每个手机号码的,呼叫时长和呼叫次数,被叫时长,被叫次数 ,并输出格式 为 手机号码,呼叫时长,呼叫次数,被叫时长,被叫次数;(12分)

数据格式样例:其中在呼叫时长后面加单位 秒 ;呼叫次数后面加 单位 次;被叫时长后面加单位 秒 ;被叫次数后面加 单位 次
13269361119,65秒,5次,864秒,5次

结果截图:

运行过程截图:
在这里插入图片描述

运行结果目录截图:
在这里插入图片描述

结果文件内容截图:
在这里插入图片描述
任务代码:

package com.BMC.MapReduce.ahjs.ShengSaiChuSai;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class Test02 {

    /**
     * 通话记录数据样例:18620192711,15733218050,1506628174,1506628265,650000,810000
     * 统计calls.txt 中的每个手机号码的,呼叫时长和呼叫次数,被叫时长,被叫次数 ,
     * 并输出格式手机号码,呼叫时长,呼叫次数,被叫时长,被叫次数
     */
    //Map Job
    public static  class  MyMapper extends Mapper<LongWritable, Text,Text,Text>{
        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
            String[] split = value.toString().split(",");
            String caller = split[0];
            String receiver = split[1];
            String startTime = split[2];
            String endTime = split[3];

            //计算呼叫时长
            long time =  Long.parseLong(endTime)-Long.parseLong(startTime) ;

            //输出map数据 key:(手机号码,呼叫时长)  value:呼叫号码标签call
            context.write(new Text(caller),new Text(time+","+"call"));
            //key:(手机号码,被叫时长) value:被叫号码标签receiver
            context.write(new Text(receiver),new Text(time+","+"receiver"));

        }
    }

    //Reduce Job
    public static class MyReducer extends Reducer<Text,Text,Text,Text>{
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
           //统计 呼叫时长,呼叫次数,被叫时长,被叫次数
            int callTime=0;
            int callCount=0;
            int receiverTime=0;
            int receiverCount=0;
            for (Text value : values) {
                String[] split = value.toString().split(",");
                if ("call".equals(split[1])){
                    //统计呼叫号码 时长和 次数
                    callTime= Integer.parseInt(split[0]);
                    callCount++;
                }else if ("receiver".equals(split[1])){
                    //统计被叫号码 时长和 次数
                    receiverTime= Integer.parseInt(split[0]);
                    receiverCount++;
                }
            }
            context.write(key,new Text(
                    callTime+","+callCount+","+receiverTime+","+receiverCount
            ));
        }

    }
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        //配置kv链接符 ”,“
        conf.set("mapred.textoutputformat.separator",",");

        Job job = Job.getInstance(conf);

        job.setJobName("Test02");
        job.setJarByClass(Test02.class);

        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job,new Path("/MRinput/input04_ahjs_file/input_calls"));
        //判断文件路径是否已存在
        FileSystem fileSystem = FileSystem.get(job.getConfiguration());
        if (fileSystem.exists(new Path("/MRoutput/output_anjs_test02"))){
            fileSystem.delete(new Path("/MRoutput/output_anjs_test02"));
        }
        FileOutputFormat.setOutputPath(job,new Path("/MRoutput/output_anjs_test02"));

        job.waitForCompletion(true);
    }

}

3、请使用MapReduce统计 calls.txt中的 被叫省份中 被叫次数最高的前三条记录(8分)。返回格式:省 ,被叫号码,被叫次数

结果截图:
运行过程截图:
在这里插入图片描述

运行结果目录截图:

在这里插入图片描述

结果文件内容截图:
在这里插入图片描述

详细代码:

package com.BMC.MapReduce.ahjs.ShengSaiChuSai;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.*;


public class Test03 {
    /**
     * 统计 calls.txt中的 被叫省份中 被叫次数最高的前三条记录
     * 返回格式:省 ,被叫号码,被叫次数
     * 通话记录数据样例:18620192711,15733218050,1506628174,1506628265,650000,810000
     */
    public static class MyMapper extends Mapper<LongWritable, Text,Text,Text>{
        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {

            String[] split = value.toString().split(",");

            String receiver = split[1];
            String receiverPID = split[5];

           // context.write(new Text(receiverPID),new Text(receiver+","+1));
            context.write(new Text(receiverPID), new Text(receiver));
        }
    }



    public static class MyReducer extends Reducer<Text,Text,Text,IntWritable>{
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {

            String provinceId = key.toString();

            //排序,统计被叫次数前三的
            //定义一个Hashtable,接收 receiver  sum
            Hashtable<String, Integer> ht = new Hashtable<String, Integer>();

            // 遍历values 构造Hashtable
            for (Text value : values) {
                String receiver = value.toString();
                Integer receiverCount= ht.getOrDefault(receiver, 0);
                receiverCount++;
                ht.put(receiver, receiverCount);
            }

            //排序,取前三
            ArrayList<Map.Entry<String, Integer>> list = new ArrayList<>(ht.entrySet());

            //构造,排序
            Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
                //重新构造,排序规则
                @Override
                public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                    return o2.getValue().compareTo(o1.getValue());
                }
            });

            //取前三
            int cnt = 1;
            for (Map.Entry<String, Integer> mapping : list) {
                if (count <= 3) {
                    String receiver = mapping.getKey();
                    Integer receiverCount= mapping.getValue();
                    context.write(new Text(provinceId + "," + receiver), new IntWritable(receiverCount));
                }
                count ++;
            }

        }
    }

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        conf.set("mapred.textoutputformat.separator",",");

        Job job = Job.getInstance(conf);
        job.setJobName("Test03");
        job.setJarByClass(Test03.class);

        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);


        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

		//指定输入输出路径
        FileInputFormat.addInputPath(job,new Path("/MRinput/input04_ahjs_file/input_calls"));

        FileSystem fs = FileSystem.get(job.getConfiguration());
        if (fs.exists(new Path("/MRoutput/output_anjs_test03"))){
            fs.delete(new Path("/MRoutput/output_anjs_test03"),true);
        }
        FileOutputFormat.setOutputPath(job,new Path("/MRoutput/output_anjs_test03"));

        job.waitForCompletion(true);

    }
}


标签:String,org,赛题,job,2021,import,apache,new,数据
来源: https://blog.csdn.net/dazuo_001/article/details/120487113