0

I want to separate a column of string characters. In one column I want all capitilized words. The strings can have one or two uppercase words. Here is an example of the dataframe:

mydataframe <- data.frame(species= c("ACTINIDIACEAE Actinidia arguta", 
           "ANACARDIACEAE Attilaea abalak E.Martínez & Ramos", 
           "LEGUMINOSAE CAESALPINIOIDEAE Biancaea decapetala (Roth) O.Deg."),
           trait= c(1,2,4))

I tried with separate and the following regular expression: "\\s+(?=[A-Z]+)". This is not working. For the strings with more than two capitilized words it separates the first and the second capitilized words, removing the rest of the string. Here is the code:

mydataframe <- mydataframe %>%
              separate(species, into = c("family", "sp"), sep ="\\s+(?=[A-Z]+)")

This is the result of the code:

family sp trait
ACTINIDIACEAE Actinidia arguta 1
ANACARDIACEAE Attilaea abalak 2
LEGUMINOSAE CAESALPINOIDEAE 4

I want the following format:

family sp trait
ACTINIDIACEAE Actinidia arguta 1
ANACARDIACEAE Attilaea abalak 2
LEGUMINOSAE CAESALPINOIDEAE Biancaea decapetala 4
2
  • I'm glad you're giving us small sample data, but please try it before posting the question. In this case, Error: arguments imply differing number of rows: 3, 4. Likely solved easily by reducing the length of trait=, but in general reprex questions should not error (outside of the context of the question).
    – r2evans
    Commented Jun 23, 2023 at 19:36
  • 1
    Sorry for the mistake. I edited the post and know it should work. Commented Jun 23, 2023 at 19:38

1 Answer 1

2

I think we can use (base) strcapture for this to find the last occurrence of two upper-case in a row, then blank space, then a word with at least one lower-case letter.

mydataframe %>%
  mutate(strcapture("(.*[A-Z]{2,})\\s+(\\S*[a-z].*)", species, list(family="", sp="")))
#                                                          species trait                       family                                 sp
# 1                                 ACTINIDIACEAE Actinidia arguta     1                ACTINIDIACEAE                   Actinidia arguta
# 2               ANACARDIACEAE Attilaea abalak E.Martínez & Ramos     2                ANACARDIACEAE Attilaea abalak E.Martínez & Ramos
# 3 LEGUMINOSAE CAESALPINIOIDEAE Biancaea decapetala (Roth) O.Deg.     4 LEGUMINOSAE CAESALPINIOIDEAE  Biancaea decapetala (Roth) O.Deg.

Not the answer you're looking for? Browse other questions tagged or ask your own question.