首页 > 其他分享> > 莫纳什FIT1043ass2课业解析

莫纳什FIT1043ass2课业解析

2019-09-19 10:02:37 作者：互联网

题意：

使用python来研究一个热带大气海洋数据集

解析;

任务A：分析数据集，查看数据集行与列的大小，每列中的最值，列出每个月的记录数，删除缺失值，可视化不同月份的海面温度，探索降水测量，调查每日相对湿度。读入数据集，输出数据集的info和shape就可知道数据集基本信息，再对数据集中对应列按照要求去求最值（min和max）或者平均值，再用plot画出相应图像即可。

任务B：寻找自己的数据集进行kmeans聚类分析，簇的数量大于二，并且用不同颜色区分kmeans聚类结果。

涉及知识点：

python数据分析，数理统计，k均值聚类

更多可加微信讨论

*微信号：lili_950826
pdf

FIT1043 Assignment 2: Description

Due date: Friday 27th September 2019- 11:55pm
Aim
The aim of this assignment is to investigate and visualise data using Python as a data science tool.
It will test your ability to:

read a data file in Python and extract related data from it;
use various graphical and non-graphical tools to perform data exploration, data wrangling
and data analysis;
use basic tools for managing and processing data; and
communicate your findings in your report.
Data
The dataset we will use comes from the TAO (Tropical Atmosphere Ocean) project, by the
Pacific Marine Environmental Lab of the U.S. National Oceanic and Atmospheric Administration.
This monitors the atmosphere in the tropical Pacific Ocean.
• The Tropical Atmosphere Ocean dataset we chose (TAO_2006.csv file) contains
atmosphere data from a specific monitoring site: (2◦N,165◦E).
• We chose to investigate environment data from January until September 2006, where the
measurements were taken every 10 minutes.
• The dataset contains information about Timestamp, date (YYYYMMDD) and time
(HHMMSS) of measurements, Precipitation (PREC), Air Temperature (AIRT), Sea Surface
Temperature (SST), Relative Humidity (RH), and the Quality (Q) of measurements.

• The file is available on Moodle and is publicly available from pmel.noaa.gov.
Hand-in Requirements
Please hand in a PDF file1 containing your answers and a Jupyter notebook file (.ipynb)
containing your Python code to all the questions respectively:
● PDF file should contain:

Answers to the questions. In order to justify your answers to all the questions, make
sure to
a. Include screenshots/images of the graphs you generate (You will need to use
screen-capture functionality to create appropriate images.)
b. Include copy/paste of your Python code (not images of your code but the actual
text).
● Ipynb file should contain:
A copy of your working Python code to answer the questions.
● You will need to submit two separate files (the PDF file and the ipynb file). Zip, rar or any
other similar file compression format are not acceptable and will have a penalty of 10%.
1 You can use Word or other word processing software to format your submission. Just save the final
copy to a PDF before submitting.
Supportive Material/Code:
• Material: In order to complete your assignment, you may want to use regressiondemo.py
code used in week 5 tutorial. If you use this code, you do not need to upload the
regressiondemo.py file in your final submission.
• Code: If “YYYYMMDD” is in datetime format, you can extract month and day from it using
method .dt and create a new column for month and day as follows:

your_dataframe[‘Month’]=your_dataframe[‘YYYYMMDD’].dt.month
your_dataframe[‘Day’]=your_dataframe[‘YYYYMMDD’].dt.day
Python Availability
You will need to use Python to complete the assignment. You can do this by either:

running a Jupyter Notebook on a computer in the labs; or
installing Python (we recommend Anaconda) on your own machine.
Assignment Tasks:
There are two tasks that you need to complete for this assignment. Students that complete only
Tasks A1-A9 can only get a maximum of Distinction. Students that attempt tasks A10 and B
will be showing critical analysis skills and a deeper understanding of the task at hand and can
achieve the highest grade. You need to use Python to complete the tasks.
Task A: Data Wrangling and Analysis on TAO dataset
In this task, you are required to explore the dataset and do some data analysis on the Tropical
Atmosphere Ocean dataset. Have a look at the csv file (TAO_2006.csv) and then answer a series
of questions about the data using Python.
A1. Dataset size
How many rows and columns exist in this dataset?
A2. Min/Max values in each column
Find maximum and minimum values for Precipitation (PREC), Air temperature (AT), Sea surface
temperature (SST) and Relative humidity (RH) in this dataset.
A3. Number of records in each month
List the number of records in each month. In which two months are the number of records at their
lowest? Why?
A4. Missing values
There are some missing values: -9.990000 and -99.900000 represent missing values.

How many rows contain missing values (-9.990000 or -99.900000) in this dataset?
List the months with no missing values in them.
Remove the records with missing values.
Note: Use the dataset with missing values removed from here onwards.
A5. Investigating Sea surface temperature (SST) in different months
Now look at the sea surface temperature (SST) column and answer the following questions
Using a boxplot, visualize the distribution of SST over different months.
Describe the trend of median SST over different months.
Which month has the highest median SST? Which month has the lowest?
A6. Exploring precipitation measurements (PREC)
Now look at the Precipitation column and answer the following questions
Precipitation values in this dataset show rain rates. Plot Precipitation values over different
timestamps.
Due to measurement error, there are some counter-intuitive values in Precipitation
column. Identify those values and replace them with zero.
Note: Use the dataset from previous task (Task A6) and complete Tasks, A7-A9.
A7. Relationship between variables
Compute pairwise correlation of columns, precipitation, air temperature and surface
temperature. Which two features have the least linear association?
Now let’s look at the relationship between air temperature and relative humidity. Plot the
values of these features against each other. Is there any relationship between these two
features? Describe it.
A8. Predicting quality of measurements (Q)
We now want to build a predictive model to predict the quality of measurements (Q) in the dataset
based on four features: Precipitation (PREC), Air temperature (AIRT), Sea surface temperature
(SST) and Relative humidity (RH).
Divide the dataset into a 75% training set and a 25% testing set and train a decision tree
model.
Using test set, compute the confusion matrix and accuracy.
Considering accuracy only, do you think that this is a good model? What other metric(s)
should we consider as well? Why? Elaborate your answer.
A9. Investigating daily relative humidity (RH)
We will now investigate the trend in the daily relative humidity over time. For this, you will need to
aggregate the median relative humidity by day.
Fit a linear regression using Python to this data (i.e., relative humidity over different days)
and plot the linear fit.
Use the linear fit to predict median relative humidity on 2nd September 2006.
Can you think of a better model that fits all of the aggregated data to capture the trend in
relative humidity over time? Describe the model you suggested and explain why it is
better suited for this task.
Use your new model to predict median relative humidity on 2nd September 2006 and
compare with the prediction of your previous linear fit.
A10. Filling in missing values
Rather than removing the missing values in task A4, fill in the missing values (for column, RH
only) using an appropriate regression model.
Task B: K-means Clustering on Other Data
We have demonstrated k-means clustering algorithm in week 7. Your task in this part is to find an
interesting dataset and apply k-means clustering on a dataset using Python. Kaggle, a private
company which runs data science competitions, provides a list of their publicly available datasets:
https://www.kaggle.com/datasets
In particular you need to:
choose two numerical features in your dataset and apply k-means clustering on your data
into k clusters in Python, where k>=2.
visualise the data as well as the results of the k-means clustering. Ideally each cluster is
shown in a different colour.
describe your findings about the identified clusters.
investigate/suggest some appropriate measures to evaluate the quality of your clusters.
You can search online for this task.
Please note you cannot use the same data set used in tutorials in this unit.
Please include a link to your dataset in your report. You may wish to:
provide the direct link to the public dataset from the internet, or
place the data file in your Monash student - google drive and provide its link in the
submission.
Good Luck!

标签：课业,Python,FIT1043ass2,dataset,values,file,纳什,data,your
来源： https://blog.csdn.net/weixin_45596679/article/details/101012887