Read text stream codepoint by codepoint












5














I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.



My test program:



import java.io.*;
import java.nio.charset.*;

class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


This behaves as follows:



$ java TestChars 
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.


My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.



Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)



I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.










share|improve this question
























  • The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
    – Andreas
    Nov 12 '18 at 22:45








  • 1




    I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
    – Isabelle Newbie
    Nov 12 '18 at 22:49
















5














I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.



My test program:



import java.io.*;
import java.nio.charset.*;

class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


This behaves as follows:



$ java TestChars 
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.


My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.



Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)



I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.










share|improve this question
























  • The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
    – Andreas
    Nov 12 '18 at 22:45








  • 1




    I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
    – Isabelle Newbie
    Nov 12 '18 at 22:49














5












5








5







I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.



My test program:



import java.io.*;
import java.nio.charset.*;

class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


This behaves as follows:



$ java TestChars 
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.


My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.



Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)



I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.










share|improve this question















I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.



My test program:



import java.io.*;
import java.nio.charset.*;

class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


This behaves as follows:



$ java TestChars 
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.


My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.



Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)



I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.







java unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 '18 at 22:35







Isabelle Newbie

















asked Nov 12 '18 at 22:23









Isabelle NewbieIsabelle Newbie

2,92811124




2,92811124












  • The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
    – Andreas
    Nov 12 '18 at 22:45








  • 1




    I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
    – Isabelle Newbie
    Nov 12 '18 at 22:49


















  • The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
    – Andreas
    Nov 12 '18 at 22:45








  • 1




    I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
    – Isabelle Newbie
    Nov 12 '18 at 22:49
















The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45






The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45






1




1




I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49




I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49












2 Answers
2






active

oldest

votes


















2














If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:



import java.io.*;

class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}


will produce



$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕





share|improve this answer























  • Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
    – Isabelle Newbie
    Nov 13 '18 at 20:14



















1














You can wrap the Reader instance with a simple class the decodes surrogate pairs:



import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

private Reader reader;

public CodepointStream(Reader reader) {
this.reader = reader;
}

public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF

if (!Character.isHighSurrogate((char)unit0))
return unit0;

int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF

if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");

return Character.toCodePoint((char)unit0, (char)unit1);
}

public void close() throws IOException {
reader.close();
reader = null;
}
}


The main functions needs to be slightly modified:



import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


Then your output becomes:



> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.





share|improve this answer























  • Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
    – Isabelle Newbie
    Nov 13 '18 at 20:16












  • Thanks for the hint regarding the Character class. I've updated the code accordingly.
    – Codo
    Nov 13 '18 at 20:35











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270963%2fread-text-stream-codepoint-by-codepoint%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:



import java.io.*;

class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}


will produce



$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕





share|improve this answer























  • Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
    – Isabelle Newbie
    Nov 13 '18 at 20:14
















2














If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:



import java.io.*;

class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}


will produce



$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕





share|improve this answer























  • Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
    – Isabelle Newbie
    Nov 13 '18 at 20:14














2












2








2






If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:



import java.io.*;

class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}


will produce



$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕





share|improve this answer














If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:



import java.io.*;

class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}


will produce



$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 1:10

























answered Nov 13 '18 at 0:44









ShawnShawn

3,5731613




3,5731613












  • Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
    – Isabelle Newbie
    Nov 13 '18 at 20:14


















  • Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
    – Isabelle Newbie
    Nov 13 '18 at 20:14
















Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14




Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14













1














You can wrap the Reader instance with a simple class the decodes surrogate pairs:



import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

private Reader reader;

public CodepointStream(Reader reader) {
this.reader = reader;
}

public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF

if (!Character.isHighSurrogate((char)unit0))
return unit0;

int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF

if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");

return Character.toCodePoint((char)unit0, (char)unit1);
}

public void close() throws IOException {
reader.close();
reader = null;
}
}


The main functions needs to be slightly modified:



import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


Then your output becomes:



> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.





share|improve this answer























  • Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
    – Isabelle Newbie
    Nov 13 '18 at 20:16












  • Thanks for the hint regarding the Character class. I've updated the code accordingly.
    – Codo
    Nov 13 '18 at 20:35
















1














You can wrap the Reader instance with a simple class the decodes surrogate pairs:



import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

private Reader reader;

public CodepointStream(Reader reader) {
this.reader = reader;
}

public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF

if (!Character.isHighSurrogate((char)unit0))
return unit0;

int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF

if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");

return Character.toCodePoint((char)unit0, (char)unit1);
}

public void close() throws IOException {
reader.close();
reader = null;
}
}


The main functions needs to be slightly modified:



import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


Then your output becomes:



> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.





share|improve this answer























  • Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
    – Isabelle Newbie
    Nov 13 '18 at 20:16












  • Thanks for the hint regarding the Character class. I've updated the code accordingly.
    – Codo
    Nov 13 '18 at 20:35














1












1








1






You can wrap the Reader instance with a simple class the decodes surrogate pairs:



import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

private Reader reader;

public CodepointStream(Reader reader) {
this.reader = reader;
}

public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF

if (!Character.isHighSurrogate((char)unit0))
return unit0;

int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF

if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");

return Character.toCodePoint((char)unit0, (char)unit1);
}

public void close() throws IOException {
reader.close();
reader = null;
}
}


The main functions needs to be slightly modified:



import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


Then your output becomes:



> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.





share|improve this answer














You can wrap the Reader instance with a simple class the decodes surrogate pairs:



import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

private Reader reader;

public CodepointStream(Reader reader) {
this.reader = reader;
}

public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF

if (!Character.isHighSurrogate((char)unit0))
return unit0;

int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF

if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");

return Character.toCodePoint((char)unit0, (char)unit1);
}

public void close() throws IOException {
reader.close();
reader = null;
}
}


The main functions needs to be slightly modified:



import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}


Then your output becomes:



> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 20:34

























answered Nov 12 '18 at 22:52









CodoCodo

50.7k11110148




50.7k11110148












  • Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
    – Isabelle Newbie
    Nov 13 '18 at 20:16












  • Thanks for the hint regarding the Character class. I've updated the code accordingly.
    – Codo
    Nov 13 '18 at 20:35


















  • Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
    – Isabelle Newbie
    Nov 13 '18 at 20:16












  • Thanks for the hint regarding the Character class. I've updated the code accordingly.
    – Codo
    Nov 13 '18 at 20:35
















Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16






Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16














Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35




Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270963%2fread-text-stream-codepoint-by-codepoint%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Error while running script in elastic search , gateway timeout

Adding quotations to stringified JSON object values