20220703 爬虫&数据处理
作者:互联网
1、
昨天已经获取到数据,今天发现dataframe数据单列数据存储在一行中,分列不太好分,我上网查了下。从列表转换为dataframe,正常是存储为一行,需要转置下发现确实变成逗号分开的形式了。代码如下:
data = get_data() df = pd.DataFrame(data=[data],index=['a']).T print(df.head())
如果想把列表转为字典格式,再存为dataframe呢?(参考链接:https://blog.csdn.net/linxent/article/details/104345845)
def get_data(): j = 1 total = [] while j <= 3: sleep(1) lst = [] lst1 = [] lst2 = [] lst3 = [] for i in range(1,11): Project_name = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[4]/div[2]/table/tbody/tr[%s]/td[1]"%i))) Stat_tel = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[2]/div/span"%i))) Recent_person = wait.until(EC.presence_of_element_located((By.XPATH,"//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[5]"%i))) Last_Updated = wait.until(EC.presence_of_element_located((By.XPATH,"//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[6]/div/span"%i))) lst.append(Project_name.text) lst1.append(Stat_tel.text) lst2.append(Recent_person.text) lst3.append(Last_Updated.text) len_a = len(lst) len_b = len(lst1) len_c = len(lst2) len_d = len(lst3) if len_a !=len_b or len_a != len_c != len_d: print("抓取到数据个数不同") for i in range(len_a): total.append(lst[i]+","+ lst1[i]+"," + lst2[i]+","+ lst3[i]) fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/ul/li[%s]"%j))) fanye.click() print("已抓取第%s页"%j) #fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/span[2]/div/input"))) #fanye.send_keys(j) sleep(2) #fanye.send_keys(Keys.ENTER) j += 1 sleep(1) return total #def data_clean(): data = get_data() df = pd.DataFrame(data=[data],index=['a']).T print(df.head()) df1 = df.join(df['a'].str.split(',', expand=True)) print(df1)
用上面的方法的确是可以分割开单列数据,问题在于取到的列中。时间标注为上午、星期一之类的,不是标准日期字符串。这个后续再改吧。
标签:20220703,get,df,数据,单列,爬虫,dataframe,数据处理,data 来源: https://www.cnblogs.com/dion-90/p/16439688.html