Read text stream codepoint by codepoint
I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader
class returns the stream's contents int
by int
, which I hoped would do what I want, but it does not compose surrogate pairs.
My test program:
import java.io.*;
import java.nio.charset.*;
class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
This behaves as follows:
$ java TestChars
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.
My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int
and be done with it.
Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)
I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.
java unicode
add a comment |
I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader
class returns the stream's contents int
by int
, which I hoped would do what I want, but it does not compose surrogate pairs.
My test program:
import java.io.*;
import java.nio.charset.*;
class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
This behaves as follows:
$ java TestChars
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.
My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int
and be done with it.
Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)
I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.
java unicode
Theint
value returned byread()
is a UTF-16char
value, not a Unicode codepoint. The only reason it is typeint
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45
1
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49
add a comment |
I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader
class returns the stream's contents int
by int
, which I hoped would do what I want, but it does not compose surrogate pairs.
My test program:
import java.io.*;
import java.nio.charset.*;
class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
This behaves as follows:
$ java TestChars
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.
My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int
and be done with it.
Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)
I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.
java unicode
I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader
class returns the stream's contents int
by int
, which I hoped would do what I want, but it does not compose surrogate pairs.
My test program:
import java.io.*;
import java.nio.charset.*;
class TestChars {
public static void main(String args) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
This behaves as follows:
$ java TestChars
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.
My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int
and be done with it.
Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)
I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.
java unicode
java unicode
edited Nov 12 '18 at 22:35
Isabelle Newbie
asked Nov 12 '18 at 22:23
Isabelle NewbieIsabelle Newbie
2,92811124
2,92811124
Theint
value returned byread()
is a UTF-16char
value, not a Unicode codepoint. The only reason it is typeint
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45
1
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49
add a comment |
Theint
value returned byread()
is a UTF-16char
value, not a Unicode codepoint. The only reason it is typeint
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45
1
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49
The
int
value returned by read()
is a UTF-16 char
value, not a Unicode codepoint. The only reason it is type int
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.– Andreas
Nov 12 '18 at 22:45
The
int
value returned by read()
is a UTF-16 char
value, not a Unicode codepoint. The only reason it is type int
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.– Andreas
Nov 12 '18 at 22:45
1
1
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49
add a comment |
2 Answers
2
active
oldest
votes
If you take advantage of String
having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:
import java.io.*;
class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
will produce
$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
add a comment |
You can wrap the Reader instance with a simple class the decodes surrogate pairs:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
The main functions needs to be slightly modified:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
Then your output becomes:
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could useCharacter.toCodePoint
and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding theCharacter
class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270963%2fread-text-stream-codepoint-by-codepoint%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you take advantage of String
having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:
import java.io.*;
class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
will produce
$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
add a comment |
If you take advantage of String
having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:
import java.io.*;
class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
will produce
$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
add a comment |
If you take advantage of String
having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:
import java.io.*;
class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
will produce
$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕
If you take advantage of String
having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:
import java.io.*;
class cptest {
public static void main(String args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
will produce
$ java cptest <<< "keyboard ⌨. pizza 🍕"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829: 🍕
edited Nov 13 '18 at 1:10
answered Nov 13 '18 at 0:44
ShawnShawn
3,5731613
3,5731613
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
add a comment |
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14
add a comment |
You can wrap the Reader instance with a simple class the decodes surrogate pairs:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
The main functions needs to be slightly modified:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
Then your output becomes:
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could useCharacter.toCodePoint
and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding theCharacter
class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35
add a comment |
You can wrap the Reader instance with a simple class the decodes surrogate pairs:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
The main functions needs to be slightly modified:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
Then your output becomes:
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could useCharacter.toCodePoint
and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding theCharacter
class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35
add a comment |
You can wrap the Reader instance with a simple class the decodes surrogate pairs:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
The main functions needs to be slightly modified:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
Then your output becomes:
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.
You can wrap the Reader instance with a simple class the decodes surrogate pairs:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
The main functions needs to be slightly modified:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
Then your output becomes:
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', 🍕.
Code a is `LINE FEED (LF)',
.
edited Nov 13 '18 at 20:34
answered Nov 12 '18 at 22:52
CodoCodo
50.7k11110148
50.7k11110148
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could useCharacter.toCodePoint
and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding theCharacter
class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35
add a comment |
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could useCharacter.toCodePoint
and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding theCharacter
class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use
Character.toCodePoint
and related methods to get rid of the magic constants.– Isabelle Newbie
Nov 13 '18 at 20:16
Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use
Character.toCodePoint
and related methods to get rid of the magic constants.– Isabelle Newbie
Nov 13 '18 at 20:16
Thanks for the hint regarding the
Character
class. I've updated the code accordingly.– Codo
Nov 13 '18 at 20:35
Thanks for the hint regarding the
Character
class. I've updated the code accordingly.– Codo
Nov 13 '18 at 20:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270963%2fread-text-stream-codepoint-by-codepoint%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
The
int
value returned byread()
is a UTF-16char
value, not a Unicode codepoint. The only reason it is typeint
is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.– Andreas
Nov 12 '18 at 22:45
1
I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49