unix - remove (i) empty space between single characters and (ii) more than X consecutive instances of a word

up vote
-1
down vote

favorite

I would like to

(i) replace blank space between characters only if these characters are single; i.e. for instance

Down [Enter] p s -- a u x [Delete]

should become

Down [Enter] ps -- aux [Delete]

(ii) remove words that are consecutively repeated more than X times until any other thing which is not the word, so that (say X=2)

 [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab inition [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab definitio

becomes

 [Delete] [Delete] ab initio [Delete] [Delete] ab definitio

thanks!

asked Nov 9 at 19:03

Pau

Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
– Cyrus
Nov 9 at 19:16

add a comment |

up vote
-1
down vote

favorite

I would like to

(i) replace blank space between characters only if these characters are single; i.e. for instance

Down [Enter] p s -- a u x [Delete]

should become

Down [Enter] ps -- aux [Delete]

(ii) remove words that are consecutively repeated more than X times until any other thing which is not the word, so that (say X=2)

 [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab inition [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab definitio

becomes

 [Delete] [Delete] ab initio [Delete] [Delete] ab definitio

thanks!

asked Nov 9 at 19:03

Pau

Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
– Cyrus
Nov 9 at 19:16

add a comment |

up vote
-1
down vote

favorite

I would like to

(i) replace blank space between characters only if these characters are single; i.e. for instance

Down [Enter] p s -- a u x [Delete]

should become

Down [Enter] ps -- aux [Delete]

(ii) remove words that are consecutively repeated more than X times until any other thing which is not the word, so that (say X=2)

 [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab inition [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab definitio

becomes

 [Delete] [Delete] ab initio [Delete] [Delete] ab definitio

thanks!

asked Nov 9 at 19:03

Pau

I would like to

(i) replace blank space between characters only if these characters are single; i.e. for instance

Down [Enter] p s -- a u x [Delete]

should become

Down [Enter] ps -- aux [Delete]

(ii) remove words that are consecutively repeated more than X times until any other thing which is not the word, so that (say X=2)

 [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab inition [Delete] [Delete] [Delete] [Delete] [Delete] [Delete] ab definitio

becomes

 [Delete] [Delete] ab initio [Delete] [Delete] ab definitio

thanks!

unix awk sed

asked Nov 9 at 19:03

Pau

asked Nov 9 at 19:03

Pau

asked Nov 9 at 19:03

Pau

asked Nov 9 at 19:03

Pau

asked Nov 9 at 19:03

Pau

Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
– Cyrus
Nov 9 at 19:16

add a comment |

Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
– Cyrus
Nov 9 at 19:16

Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
– Cyrus
Nov 9 at 19:16

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

You did not get a lot of responses. I think the main reason is the combination of two different questions, both non-trivial. Normally it helps to show your own effort, but I anderstand your effort might have been thinkong for hours "where to start".

The first question, removing spaces between single characters, can be done with a loop in sed:

echo 'Down [Enter] p s -- a u x [Delete] ' | 

   sed -r ':a;s/( [^ ]|r) ([^ ])( |$)/12r3/;ta; s/r//g'

Down [Enter] ps -- aux [Delete]

Explanation:
With a direct approach a u x will be changed into au x after the first replace, and the other space will be forgotten. You need to go over the replacements more than once and remember that the letter u in au x was a singleton in the original string.

For remembering the places where a replacement has been done, we use a r (and remove it later).

:a; Label to return for the next replacement.
( [^ ]|r) A space followed by a letter OR our temporary r marker
([^ ]) A space followed by a letter
( |$) A space or end-of-line
/12r3/ Replace with the two remembered characters, insert a special marker and a space when it was not the last charater of the line.
ta Go back to the start-of-loop tag :a when something was replaced
s/r//g' Remove our temporary markers.

The second question is difficult too. The next solution is close but incorrect:

for (( X=2; X<8; X++)); do

  echo "X=$X (incorrect solution)"

  echo 'some some some some some some some some some some some input' |

     sed -r 's/([^ ]+[ ]+)(1{'${X}'})(1+)/2/g'

done

The problem is when the repeated string also appears on another place, as in
some some some input some some some or worse some some some input input input.

I do not see an easy fix for the sed solution, but awk will help here.

For counting repeated fields my solution is considering each word as one record.

for (( X=2; X<8; X++)); do

   echo "X=$X"

   echo 'some some some some some some some some some some some input some some some some' |

      awk -v x=$X 'BEGIN {RS="[ n]"; ORS='n'; repeated=1}

         { if (last==$0)

             repeated++;

           else

             repeated=1;

         }

         {last=$0}

         repeated <= x {print $0" "}

         END {print "n"}

      '

done

answered Nov 10 at 21:21

Walter A

10.1k2930

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231852%2funix-remove-i-empty-space-between-single-characters-and-ii-more-than-x-con%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

The first question, removing spaces between single characters, can be done with a loop in sed:

echo 'Down [Enter] p s -- a u x [Delete] ' | 

   sed -r ':a;s/( [^ ]|r) ([^ ])( |$)/12r3/;ta; s/r//g'

Down [Enter] ps -- aux [Delete]

The second question is difficult too. The next solution is close but incorrect:

for (( X=2; X<8; X++)); do

  echo "X=$X (incorrect solution)"

  echo 'some some some some some some some some some some some input' |

     sed -r 's/([^ ]+[ ]+)(1{'${X}'})(1+)/2/g'

done

The problem is when the repeated string also appears on another place, as in
some some some input some some some or worse some some some input input input.

I do not see an easy fix for the sed solution, but awk will help here.

For counting repeated fields my solution is considering each word as one record.

for (( X=2; X<8; X++)); do

   echo "X=$X"

   echo 'some some some some some some some some some some some input some some some some' |

      awk -v x=$X 'BEGIN {RS="[ n]"; ORS='n'; repeated=1}

         { if (last==$0)

             repeated++;

           else

             repeated=1;

         }

         {last=$0}

         repeated <= x {print $0" "}

         END {print "n"}

      '

done

answered Nov 10 at 21:21

Walter A

10.1k2930

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

add a comment |

up vote
0
down vote

The first question, removing spaces between single characters, can be done with a loop in sed:

echo 'Down [Enter] p s -- a u x [Delete] ' | 

   sed -r ':a;s/( [^ ]|r) ([^ ])( |$)/12r3/;ta; s/r//g'

Down [Enter] ps -- aux [Delete]

The second question is difficult too. The next solution is close but incorrect:

for (( X=2; X<8; X++)); do

  echo "X=$X (incorrect solution)"

  echo 'some some some some some some some some some some some input' |

     sed -r 's/([^ ]+[ ]+)(1{'${X}'})(1+)/2/g'

done

The problem is when the repeated string also appears on another place, as in
some some some input some some some or worse some some some input input input.

I do not see an easy fix for the sed solution, but awk will help here.

For counting repeated fields my solution is considering each word as one record.

for (( X=2; X<8; X++)); do

   echo "X=$X"

   echo 'some some some some some some some some some some some input some some some some' |

      awk -v x=$X 'BEGIN {RS="[ n]"; ORS='n'; repeated=1}

         { if (last==$0)

             repeated++;

           else

             repeated=1;

         }

         {last=$0}

         repeated <= x {print $0" "}

         END {print "n"}

      '

done

answered Nov 10 at 21:21

Walter A

10.1k2930

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

add a comment |

up vote
0
down vote

The first question, removing spaces between single characters, can be done with a loop in sed:

echo 'Down [Enter] p s -- a u x [Delete] ' | 

   sed -r ':a;s/( [^ ]|r) ([^ ])( |$)/12r3/;ta; s/r//g'

Down [Enter] ps -- aux [Delete]

The second question is difficult too. The next solution is close but incorrect:

for (( X=2; X<8; X++)); do

  echo "X=$X (incorrect solution)"

  echo 'some some some some some some some some some some some input' |

     sed -r 's/([^ ]+[ ]+)(1{'${X}'})(1+)/2/g'

done

The problem is when the repeated string also appears on another place, as in
some some some input some some some or worse some some some input input input.

I do not see an easy fix for the sed solution, but awk will help here.

For counting repeated fields my solution is considering each word as one record.

for (( X=2; X<8; X++)); do

   echo "X=$X"

   echo 'some some some some some some some some some some some input some some some some' |

      awk -v x=$X 'BEGIN {RS="[ n]"; ORS='n'; repeated=1}

         { if (last==$0)

             repeated++;

           else

             repeated=1;

         }

         {last=$0}

         repeated <= x {print $0" "}

         END {print "n"}

      '

done

answered Nov 10 at 21:21

Walter A

10.1k2930

The first question, removing spaces between single characters, can be done with a loop in sed:

echo 'Down [Enter] p s -- a u x [Delete] ' | 

   sed -r ':a;s/( [^ ]|r) ([^ ])( |$)/12r3/;ta; s/r//g'

Down [Enter] ps -- aux [Delete]

The second question is difficult too. The next solution is close but incorrect:

for (( X=2; X<8; X++)); do

  echo "X=$X (incorrect solution)"

  echo 'some some some some some some some some some some some input' |

     sed -r 's/([^ ]+[ ]+)(1{'${X}'})(1+)/2/g'

done

The problem is when the repeated string also appears on another place, as in
some some some input some some some or worse some some some input input input.

I do not see an easy fix for the sed solution, but awk will help here.

For counting repeated fields my solution is considering each word as one record.

for (( X=2; X<8; X++)); do

   echo "X=$X"

   echo 'some some some some some some some some some some some input some some some some' |

      awk -v x=$X 'BEGIN {RS="[ n]"; ORS='n'; repeated=1}

         { if (last==$0)

             repeated++;

           else

             repeated=1;

         }

         {last=$0}

         repeated <= x {print $0" "}

         END {print "n"}

      '

done

answered Nov 10 at 21:21

Walter A

10.1k2930

answered Nov 10 at 21:21

Walter A

10.1k2930

answered Nov 10 at 21:21

Walter A

10.1k2930

answered Nov 10 at 21:21

Walter A

10.1k2930

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

add a comment |

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

Dear Walter. You are absolutely right, my apologies. It is my first post here. The solution I had come up with involves python, while I would prefer to stick to usual unix tools. I was looking into tr but I did not see a way. Thanks for your reply. There's a problem with (i), because your solution is removing all 'r' from the file, at least with my sed (I'm using OpenBSD here, not linux). As for (ii), I will try it now.
– Pau
Nov 11 at 17:29

(i) In my sed the carriage return r works. It should be an unique character, perhaps Q for a test. Try replacing the r with control-v control-m.
– Walter A
Nov 11 at 20:21

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky