zl程序教程

您现在的位置是:首页 >  大数据

当前栏目

生信技能树 数据框data.frame练习1

数据 技能 Data 练习 生信 Frame
2023-06-13 09:17:24 时间

介绍:

生信技能树练习题大全:http://www.biotrainee.com/thread-1754-1-1.html by Jimmy老师

向量(vector)和数据框(data.frame)是R语言用于生信分析时最常用且最重要的两种数据类型,编程语言还是需要多练,熟能生巧,没别的捷径,学了不用也就忘了。

今天做了dataframe的第一节练习,以后有时间再做其他的。

题目链接:https://www.r-exercises.com/2016/01/04/data-frame-exercises/

答案链接:https://www.r-exercises.com/2016/01/04/data-frame-exercises-solutions/

题目

Exercise 1

Create the following data frame, afterwards invert Sex for all individuals.

自己答案

Basic=data.frame(
  Age=c(25,31,23,52,76,49,26),
  Height=c(177,163,190,179,163,183,164),
  Weight=c(57,69,83,75,70,83,53),
  Sex=c('F','F','M','M','F','M','F')
)
rownames(Basic)=c('Alex','Lilly','Mark','Oliver','Martha','Lucas','Caroline')
Sex=c('M','M','F','F','M','F','M')

标准答案

Name <- c("Alex", "Lilly", "Mark", "Oliver", "Martha", "Lucas", "Caroline")
Age <- c(25, 31, 23, 52, 76, 49, 26)
Height <- c(177, 163, 190, 179, 163, 183, 164)
Weight <- c(57, 69, 83, 75, 70, 83, 53)
Sex <- as.factor(c("F", "F", "M", "M", "F", "M", "F"))
df <- data.frame (row.names = Name, Age, Height, Weight, Sex)
levels(df$Sex) <- c("M", "F")
df

分析

第一次接触factor的因子函数和level排序,学习一下

Exercise 2

Create this data frame (make sure you import the variable Working as character and not factor).

Add this data frame column-wise to the previous one.

a) How many rows and columns does the new data frame have?

b) What class of data is in each column?

自己答案

Basic2=data.frame(
  Working=c('Yes','No','No','Yes','Yes','No','Yes')
  )
rownames(Basic2)=c('Alex','Lilly','Mark','Oliver','Martha','Lucas','Caroline')
~~Basic3=merge(Basic,Basic2)~~# 这里我自己不会合并行名相同的两个dataframe,就先这样写了
ncol(Basic3);nrow(Basic3)
class(col(Basci3))

标准答案

Name <- c("Alex", "Lilly", "Mark", "Oliver", "Martha", "Lucas", "Caroline")
Working <- c("Yes", "No", "No", "Yes", "Yes", "No", "Yes")

dfa <- data.frame(row.names = Name, Working)
dfa <- cbind (df,dfa)
dim(dfa)
#or:
nrow(dfa)
ncol(dfa)

sapply(dfa, class)
str(dfa)   

分析

cbind函数用于直接合并两个dataframe

除了ncol和nrow 之外可以直接dim(Basic3)

sapply函数用于执行功能

Exercise 3

Check what class of data is the (built-in data set) state.center and convert it to data frame.

自己答案

class(state.center)
as.data.frame(state.center)

标准答案

class (state.center)
df <- as.data.frame(state.center)

Exercise 4

Create a simple data frame from 3 vectors. Order the entire data frame by the first column.

自己答案

df1=data.frame(
  a=rnorm(10,0,1),
  b=rnorm(10,0,2)
  c=rnorm(10,0,3)
)
#不会排序

标准答案

# Example vectors

v <- c(45:41, 30:33)
b <- LETTERS[rep(1:3, 3)]
n <- round(rnorm(9, 65, 5))
df <- data.frame(Age = v, Class = b, Grade = n)

df[with (df, order(Age)),]

#or:

df[order(df$Age), ]  

分析

order函数的排序,学习一下

Exercise 5

Create a data frame from a matrix of your choice, change the row names so every row says id_i (where i is the row number) and change the column names to variable_i (where i is the column number). I.e., for column 1 it will say variable_1, and for row 2 will say id_2 and so on.

自己答案

ma=matrix(1:12,3,4)
nrow(ma);ncol(ma)
rownames(ma)=paste('id',1:3,sep = '_')
colnames(ma)=paste('variable',1:4,sep = '_')

标准答案

matr <- matrix(1:20, ncol = 5) 
df <- as.data.frame(matr)
colnames(df) <- paste("variable_", 1:ncol(df))
rownames(df) <- paste("id_", 1:nrow(df))

分析

取名字或其他要数行列的情况下,可以直接通过ncol和nrow代替

Exercise 6

For this exercise, we’ll use the (built-in) dataset VADeaths.

a) Make sure the object is a data frame, if not change it to a data frame.

b) Create a new variable, named Total, which is the sum of each row.

c) Change the order of the columns so total is the first variable.

自己答案

class(VADeaths)
dfv=as.data.frame(VADeaths)
dfv$Total=rowSums(dfv)
#rowsums是查找后得知的
#不会排序

标准答案

class(VADeaths)
df <- as.data.frame(VADeaths)

df$Total <- df[, 1] + df[, 2] + df[, 3] + df[, 4]
df$Total <- rowSums(df[1:4])   

df <- df[, c(5, 1:4)]

分析

排序方式,即从原dataframe取一个新的子集,按所需要的顺序(如列)取

Exercise 7

For this exercise we’ll use the (built-in) dataset state.x77.

a) Make sure the object is a data frame, if not change it to a data frame.

b) Find out how many states have an income of less than 4300.

c) Find out which is the state with the highest income.

自己答案

class(state.x77)
dfs=as.data.frame(state.x77)
table(dfs$Income<4300)
dfsh=dfs[dfs$Income==max(dfs$Income),]
rownames(dfsh)

标准答案

class (state.x77)
df <- as.data.frame(state.x77)

nrow(subset(df, df$Income < 4300))

row.names(df)[(which(max(df$Income) == df$Income))]

分析

which函数,学习一下

Exercise 8

With the dataset swiss, create a data frame of only the rows 1, 2, 3, 10, 11, 12 and 13, and only the variables Examination, Education and Infant.Mortality.

a) The infant mortality of Sarine is wrong, it should be a NA, change it.

b) Create a row that will be the total sum of the column, name it Total.

c) Create a new variable that will be the proportion of Examination (Examination / Total)

自己答案

class(swiss)
dfs2=swiss[c(1,2,3,10,11,12,13),c('Examination','Education','Infant.Mortality')]
dfs2['Sarine','Infant.Mortality']=NA
dfs2['Total',]=colSums(dfs2)
newvariable=dfs2$Examination[1:(nrow(dfs2)-1)]/rowSums(dfs2[nrow(dfs2)-1,])

标准答案

df <- swiss[c(1:3, 10:13), c("Examination", "Education", "Infant.Mortality")]

df[4,3] <- NA

df["Total",] <- c(sum(df$Examination), sum(df$Education), sum(df$Infant.Mortality, na.rm = TRUE))

df$proportion <- round(df$Examination / df["Total", "Examination"], 3)

分析

最后一个取比例,我自己是把简单的事情复杂化,因为想避开Total/Total这一项;另外Total examination可以直接用df"Total", "Examination"选取,没有必要用rowSums(dfs2nrow(dfs2)-1,再算一遍。round函数取小数点后几位。

Exercise 9

Create a data frame with the datasets state.abb, state.area, state.division, state.name, state.region. The row names should be the names of the states.

a) Rename the column names so only the first 3 letters after the full stop appear (e.g. States.abb will be abb).

自己答案

dfstate=data.frame(state.abb,state.area,state.division,state.region,row.names = state.name)
#不会取字符串子集

标准答案

f <- data.frame(state.abb, state.area, state.division, state.region, row.names = state.name)

names(df) <- substr(names(df), 7, 9)

分析

substr函数取字符串子集,学习一下

Exercise 10

Add the previous data frame column-wise to state.x77

a) Remove the variable div.

b) Also remove the variables Life Exp, HS Grad, Frost, abb, and are.

c) Add a variable to the data frame which should categorize the level of illiteracy:

[0,1) is low, [1,2) is some, [2, inf) is high.

d) Find out which state from the west, with low illiteracy, has the highest income, and what that income is.

自己答案

dfstate2=cbind(state.x77,dfstate)
#a题
dfstate2=dfstate2[,-(colnames(dfstate2)=='div')]

#b题
~~dfstate2=dfstate2[,!(colnames(dfstate2)==('Life··Exp'|'HS··Grad'|'Frost'|'abb'|'are'))]~~
#上述代码报错,空格无法解决,后尝试用%in%
dfstate2=dfstate2[,colnames(dfstate2)%in% c('Life··Exp','HS··Grad','Frost','abb','are')]

#c题不会按值的区间分类,看答案后解决

#d题
dfstate3=dfstate2[dfstate2$reg=='West'&dfstate2$illi=='Low Illiteracy',]
rownames(dfstate3[dfstate3$Income==max(dfstate3$Income),])

标准答案

dfa <- cbind(state.x77, df)

#a)
dfa$div <- NULL

#b)
dfa <- subset(dfa, ,-c(4, 6, 7, 9, 10))

# c)
dfa$illi <- ifelse(dfa$Illiteracy < 1,"Low Illiteracy",
                 ifelse(dfa$Illiteracy >= 1 & dfa$Illiteracy < 2, "Some Illiteracy",
                 "High Illiteracy")
                 )
# Or:

dfa$illi <- cut(dfa$Illiteracy,
                c(0, 1, 2, 3),
                include.lowest = TRUE,
                right = FALSE,
                labels = c("Low Illiteracy", "Some Illiteracy", "High Illliteracy"))

# d)

sub <- subset(dfa, illi == "Low Illiteracy" & reg == "West")
max <- max(sub$Income)
stat <- row.names(sub)[which (sub$Income == max)]
cat("Highest income from the West is", max , "the state where it's from is", stat, "\n")

分析

1、b题用了subset函数,学习一下

2、c题根据值的区间将其定义为因子,ifelse容易理解,而cut函数专用于numeric向factor的转变,具有普遍性,学会了都通用。0,1,2,3四个数将0-3分成了三个区间,include.lowest代表左边的值取不取,right表示右边的值取不取,意思就是数学中的左开右闭/左闭右开区间。最后的labels就是分三级。

3、d题用了cat函数,最后输出了一句完整的句子:## Highest income from the West is 5149 the state where it's from is Nevada

写在最后

根据我这两天写代码试运行的结果来看,90%的错误会出现在忘记c,引号('')和逗号(,)这三个上面。忘记c就是忘记创建向量直接写了元素;忘记引号就是把要写的字符直接打成了变量,而变量本身不存在,所以经常会报错;忘记逗号主要是在数据框取某些行或列,只写了行或列的条件,没写逗号表示出行或列,另外就是在创建数据框的不同列时忘记用逗号分隔。所以报错的时候时常想想 c '' , 这三个,或许问题就能解决了。

以后有时间再更新其它练习。