2

I've been able to remove all special characters from a fixed length file which appear in the first column but as a result all subsequent columns have shifted to the left by the amount of characters deleted. It is a space separated file. Line 1 in input file is corrupt. Line 2 is what it should look like. The string 000022000362700 starts at position 49 on both lines. The problem I'm having is that after removing the 3 special chars the field moves to position 46.

GAVISCON LIQUID PEPPERMINT �OT        000022000362700   159588000007979400  50001584182        0006S020000
GAVISCON LIQUID PEPPERMINT OT           000022000362700   159588000007979400  50001584182        0006S020000

The command I'm using is as follows :

cat file.txt | grep '[^ - ~]' | sed's/[^ - ~]//g'

This produces following output:

    GAVISCON LIQUID PEPPERMINT OT        000022000362700   159588000007979400  50001584182        0006S020000

By removing the special characters every field to the right of the changed field has moved to the left changing the field start positions.

I've been searching for a while now and cannot find any solution for this issue.

How should I proceed?

4
  • Are the columns tab separated?
    – heemayl
    Commented Apr 21, 2015 at 11:04
  • Please edit your question and show us an example of your input and your desired output. Also, what OS are you running on? Is this Linux?
    – terdon
    Commented Apr 21, 2015 at 11:15
  • 2
    Are you really using [^ - ~] (inclusive spaces)? - What is that supposed to do?
    – Janis
    Commented Apr 21, 2015 at 11:17
  • 0xEF 0xBF 0xBD is the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER (U+FFFD). You are apparently showing us the Latin-1 interpretation of this data. There are multiple errors here -- ideally, you should fix the input so this character isn't in the input stream in the first place (and probably strive to use Unicode all the way instead of legacy 8-bit encodings).
    – tripleee
    Commented Apr 28, 2017 at 12:27

2 Answers 2

1

Use this command:

sed -r 's/(\^|-|~)/ /g' file.txt
  • sed -r

    -r, --regexp-extended
    use extended regular expressions in the script

  • / / a space as your field separator (or any other string)

  • (\^|-|~)

    • 1st Capturing group (\^|-|~)

      • 1st Alternative: \^

        \^ matches the character ^ literally

      • 2nd Alternative: -

        - matches the character - literally

      • 3rd Alternative: ~

        ~ matches the character ~ literally

An other variant is this (Thx @Costas):

sed 's/[-~^]/ /g' file.txt
  • [^-~]

    • [-~^] match a single character present in the list below

      -~^ a single character in the list -~^ literally

5
  • 2
    (\^|-|~) ? May be [-~^]?
    – Costas
    Commented Apr 21, 2015 at 12:51
  • @Costas yes, that works also
    – A.B.
    Commented Apr 21, 2015 at 12:54
  • sed - r is not recognised. Get illegal option --r. I thought we would need some type of awk command to manipulate columns? Commented Apr 21, 2015 at 13:25
  • while using only [], -r is not needed..
    – heemayl
    Commented Apr 21, 2015 at 13:33
  • @heemayl sorry, my mistake
    – A.B.
    Commented Apr 21, 2015 at 13:35
1

sed's/[^ - ~]//g' is probably not the command you used, because it would just complain about an invalid command. Always copy paste!

I presume you actually ran sed 's/[^ -~]//g'. This replaces any character that isn't a printable ASCII character by an empty string. In other words, this removes all characters that are not printable ASCII characters. (Note that this is true in the default locale, i.e. under LC_ALL=C, but it is not the case in many other locales.)

To keep the columns lined up, replace each non-printable character by a space.

sed 's/[^ -~]/ /g'

Because of your grep command, only the lines that contained a non-printable characters will appear in the output. You don't need that grep. Pass all lines to sed; the lines that don't need to be modified will appear at the right place in the output.

<file.txt LC_ALL=C sed 's/[^ -~]/ /g' >new-file.txt

This adds spaces in the middle of the columns, e.g. you'll end up with

GAVISCON LIQUID PEPPERMINT    OT        000022000362700   159588000007979400  50001584182        0006S020000

If you want the spaces to end up at the right of the column, i.e.

GAVISCON LIQUID PEPPERMINT OT           000022000362700   159588000007979400  50001584182        0006S020000

you'll need a different approach, where you indicate where columns stop. Although this can be done in sed, it's a lot easier in awk. Here's how you can remove non-printable characters from the first column and keep the data from the other columns starting at position 49.

<file.txt LC_ALL=C awk '{
    first_column = substr($0, 1, 48);
    gsub(/[^ -~]/, "", first_column);
    printf "%-48s%s\n", first_column, substr($0, 49)
}' >new-file.txt

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .