Read text stream codepoint by codepoint

I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.

My test program:

import java.io.*;

import java.nio.charset.*;



class TestChars {

    public static void main(String args) {

        InputStreamReader reader =

            new InputStreamReader(System.in, StandardCharsets.UTF_8);

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                    String.format("Code %x is `%s', %s.",

                                  code,

                                  Character.getName(code),

                                  new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

This behaves as follows:

$ java TestChars 

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code d83c is `HIGH SURROGATES D83C', ?.

Code df55 is `LOW SURROGATES DF55', ?.

Code a is `LINE FEED (LF)', 

.

My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.

Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)

I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45

1

I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49

add a comment |

My test program:

import java.io.*;

import java.nio.charset.*;



class TestChars {

    public static void main(String args) {

        InputStreamReader reader =

            new InputStreamReader(System.in, StandardCharsets.UTF_8);

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                    String.format("Code %x is `%s', %s.",

                                  code,

                                  Character.getName(code),

                                  new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

This behaves as follows:

$ java TestChars 

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code d83c is `HIGH SURROGATES D83C', ?.

Code df55 is `LOW SURROGATES DF55', ?.

Code a is `LINE FEED (LF)', 

.

My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.

Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)

I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45

1

I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49

add a comment |

My test program:

import java.io.*;

import java.nio.charset.*;



class TestChars {

    public static void main(String args) {

        InputStreamReader reader =

            new InputStreamReader(System.in, StandardCharsets.UTF_8);

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                    String.format("Code %x is `%s', %s.",

                                  code,

                                  Character.getName(code),

                                  new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

This behaves as follows:

$ java TestChars 

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code d83c is `HIGH SURROGATES D83C', ?.

Code df55 is `LOW SURROGATES DF55', ?.

Code a is `LINE FEED (LF)', 

.

My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.

Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)

I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

My test program:

import java.io.*;

import java.nio.charset.*;



class TestChars {

    public static void main(String args) {

        InputStreamReader reader =

            new InputStreamReader(System.in, StandardCharsets.UTF_8);

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                    String.format("Code %x is `%s', %s.",

                                  code,

                                  Character.getName(code),

                                  new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

This behaves as follows:

$ java TestChars 

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code d83c is `HIGH SURROGATES D83C', ?.

Code df55 is `LOW SURROGATES DF55', ?.

Code a is `LINE FEED (LF)', 

.

My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.

Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)

I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.

java unicode

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

edited Nov 12 '18 at 22:35

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

asked Nov 12 '18 at 22:23

Isabelle Newbie

2,92811124

The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45

1

I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49

add a comment |

The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45

1

I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49

The int value returned by read() is a UTF-16 char value, not a Unicode codepoint. The only reason it is type int is so it can also return -1. The code is doing exactly what it is supposed to do, i.e. return UTF-16 surrogate pairs.
– Andreas
Nov 12 '18 at 22:45

I understand that this class does not do what I want, which is why my question was whether there was another standard class that does do what I want.
– Isabelle Newbie
Nov 12 '18 at 22:49

add a comment |

2 Answers
2

active

oldest

votes

If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

import java.io.*;



class cptest {

    public static void main(String args) {

        try (BufferedReader br =

                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {

            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);

        } catch (Exception e) {

            System.err.println("Error: " + e);

        }

    }

    private static void print(int cp) {

        String s = new String(Character.toChars(cp));

        System.out.println("Character " + cp + ": " + s);

    }

}

will produce

$ java cptest <<< "keyboard ⌨. pizza 🍕"

Character 107: k

Character 101: e

Character 121: y

Character 98: b

Character 111: o

Character 97: a

Character 114: r

Character 100: d

Character 32:  

Character 9000: ⌨

Character 46: .

Character 32:  

Character 112: p

Character 105: i

Character 122: z

Character 122: z

Character 97: a

Character 32:  

Character 127829: 🍕

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

add a comment |

You can wrap the Reader instance with a simple class the decodes surrogate pairs:

import java.io.Closeable;

import java.io.IOException;

import java.io.Reader;



public class CodepointStream implements Closeable {



    private Reader reader;



    public CodepointStream(Reader reader) {

        this.reader = reader;

    }



    public int read() throws IOException {

        int unit0 = reader.read();

        if (unit0 < 0)

            return unit0; // EOF



        if (!Character.isHighSurrogate((char)unit0))

            return unit0;



        int unit1 = reader.read();

        if (unit1 < 0)

            return unit1; // EOF



        if (!Character.isLowSurrogate((char)unit1))

            throw new RuntimeException("Invalid surrogate pair");



        return Character.toCodePoint((char)unit0, (char)unit1);

    }



    public void close() throws IOException {

        reader.close();

        reader = null;

    }

}

The main functions needs to be slightly modified:

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;



public final class App {

    public static void main(String args) {

        CodepointStream reader = new CodepointStream(

                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                        String.format("Code %x is `%s', %s.",

                                code,

                                Character.getName(code),

                                new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

Then your output becomes:

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code 1f355 is `SLICE OF PIZZA', 🍕.

Code a is `LINE FEED (LF)', 

.

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270963%2fread-text-stream-codepoint-by-codepoint%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

import java.io.*;



class cptest {

    public static void main(String args) {

        try (BufferedReader br =

                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {

            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);

        } catch (Exception e) {

            System.err.println("Error: " + e);

        }

    }

    private static void print(int cp) {

        String s = new String(Character.toChars(cp));

        System.out.println("Character " + cp + ": " + s);

    }

}

will produce

$ java cptest <<< "keyboard ⌨. pizza 🍕"

Character 107: k

Character 101: e

Character 121: y

Character 98: b

Character 111: o

Character 97: a

Character 114: r

Character 100: d

Character 32:  

Character 9000: ⌨

Character 46: .

Character 32:  

Character 112: p

Character 105: i

Character 122: z

Character 122: z

Character 97: a

Character 32:  

Character 127829: 🍕

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

add a comment |

If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

import java.io.*;



class cptest {

    public static void main(String args) {

        try (BufferedReader br =

                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {

            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);

        } catch (Exception e) {

            System.err.println("Error: " + e);

        }

    }

    private static void print(int cp) {

        String s = new String(Character.toChars(cp));

        System.out.println("Character " + cp + ": " + s);

    }

}

will produce

$ java cptest <<< "keyboard ⌨. pizza 🍕"

Character 107: k

Character 101: e

Character 121: y

Character 98: b

Character 111: o

Character 97: a

Character 114: r

Character 100: d

Character 32:  

Character 9000: ⌨

Character 46: .

Character 32:  

Character 112: p

Character 105: i

Character 122: z

Character 122: z

Character 97: a

Character 32:  

Character 127829: 🍕

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

add a comment |

If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

import java.io.*;



class cptest {

    public static void main(String args) {

        try (BufferedReader br =

                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {

            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);

        } catch (Exception e) {

            System.err.println("Error: " + e);

        }

    }

    private static void print(int cp) {

        String s = new String(Character.toChars(cp));

        System.out.println("Character " + cp + ": " + s);

    }

}

will produce

$ java cptest <<< "keyboard ⌨. pizza 🍕"

Character 107: k

Character 101: e

Character 121: y

Character 98: b

Character 111: o

Character 97: a

Character 114: r

Character 100: d

Character 32:  

Character 9000: ⌨

Character 46: .

Character 32:  

Character 112: p

Character 105: i

Character 122: z

Character 122: z

Character 97: a

Character 32:  

Character 127829: 🍕

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

import java.io.*;



class cptest {

    public static void main(String args) {

        try (BufferedReader br =

                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {

            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);

        } catch (Exception e) {

            System.err.println("Error: " + e);

        }

    }

    private static void print(int cp) {

        String s = new String(Character.toChars(cp));

        System.out.println("Character " + cp + ": " + s);

    }

}

will produce

$ java cptest <<< "keyboard ⌨. pizza 🍕"

Character 107: k

Character 101: e

Character 121: y

Character 98: b

Character 111: o

Character 97: a

Character 114: r

Character 100: d

Character 32:  

Character 9000: ⌨

Character 46: .

Character 32:  

Character 112: p

Character 105: i

Character 122: z

Character 122: z

Character 97: a

Character 32:  

Character 127829: 🍕

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

edited Nov 13 '18 at 1:10

answered Nov 13 '18 at 0:44

Shawn

3,5731613

answered Nov 13 '18 at 0:44

Shawn

3,5731613

answered Nov 13 '18 at 0:44

Shawn

3,5731613

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

add a comment |

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

Thank you. This looks like a reasonable way of making the Java library take care of the details. And as I in my application I will also need some lookahead within the same line, reading a whole line into a String will make that easier as well.
– Isabelle Newbie
Nov 13 '18 at 20:14

add a comment |

You can wrap the Reader instance with a simple class the decodes surrogate pairs:

import java.io.Closeable;

import java.io.IOException;

import java.io.Reader;



public class CodepointStream implements Closeable {



    private Reader reader;



    public CodepointStream(Reader reader) {

        this.reader = reader;

    }



    public int read() throws IOException {

        int unit0 = reader.read();

        if (unit0 < 0)

            return unit0; // EOF



        if (!Character.isHighSurrogate((char)unit0))

            return unit0;



        int unit1 = reader.read();

        if (unit1 < 0)

            return unit1; // EOF



        if (!Character.isLowSurrogate((char)unit1))

            throw new RuntimeException("Invalid surrogate pair");



        return Character.toCodePoint((char)unit0, (char)unit1);

    }



    public void close() throws IOException {

        reader.close();

        reader = null;

    }

}

The main functions needs to be slightly modified:

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;



public final class App {

    public static void main(String args) {

        CodepointStream reader = new CodepointStream(

                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                        String.format("Code %x is `%s', %s.",

                                code,

                                Character.getName(code),

                                new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

Then your output becomes:

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code 1f355 is `SLICE OF PIZZA', 🍕.

Code a is `LINE FEED (LF)', 

.

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

add a comment |

You can wrap the Reader instance with a simple class the decodes surrogate pairs:

import java.io.Closeable;

import java.io.IOException;

import java.io.Reader;



public class CodepointStream implements Closeable {



    private Reader reader;



    public CodepointStream(Reader reader) {

        this.reader = reader;

    }



    public int read() throws IOException {

        int unit0 = reader.read();

        if (unit0 < 0)

            return unit0; // EOF



        if (!Character.isHighSurrogate((char)unit0))

            return unit0;



        int unit1 = reader.read();

        if (unit1 < 0)

            return unit1; // EOF



        if (!Character.isLowSurrogate((char)unit1))

            throw new RuntimeException("Invalid surrogate pair");



        return Character.toCodePoint((char)unit0, (char)unit1);

    }



    public void close() throws IOException {

        reader.close();

        reader = null;

    }

}

The main functions needs to be slightly modified:

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;



public final class App {

    public static void main(String args) {

        CodepointStream reader = new CodepointStream(

                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                        String.format("Code %x is `%s', %s.",

                                code,

                                Character.getName(code),

                                new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

Then your output becomes:

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code 1f355 is `SLICE OF PIZZA', 🍕.

Code a is `LINE FEED (LF)', 

.

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

add a comment |

You can wrap the Reader instance with a simple class the decodes surrogate pairs:

import java.io.Closeable;

import java.io.IOException;

import java.io.Reader;



public class CodepointStream implements Closeable {



    private Reader reader;



    public CodepointStream(Reader reader) {

        this.reader = reader;

    }



    public int read() throws IOException {

        int unit0 = reader.read();

        if (unit0 < 0)

            return unit0; // EOF



        if (!Character.isHighSurrogate((char)unit0))

            return unit0;



        int unit1 = reader.read();

        if (unit1 < 0)

            return unit1; // EOF



        if (!Character.isLowSurrogate((char)unit1))

            throw new RuntimeException("Invalid surrogate pair");



        return Character.toCodePoint((char)unit0, (char)unit1);

    }



    public void close() throws IOException {

        reader.close();

        reader = null;

    }

}

The main functions needs to be slightly modified:

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;



public final class App {

    public static void main(String args) {

        CodepointStream reader = new CodepointStream(

                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                        String.format("Code %x is `%s', %s.",

                                code,

                                Character.getName(code),

                                new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

Then your output becomes:

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code 1f355 is `SLICE OF PIZZA', 🍕.

Code a is `LINE FEED (LF)', 

.

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

You can wrap the Reader instance with a simple class the decodes surrogate pairs:

import java.io.Closeable;

import java.io.IOException;

import java.io.Reader;



public class CodepointStream implements Closeable {



    private Reader reader;



    public CodepointStream(Reader reader) {

        this.reader = reader;

    }



    public int read() throws IOException {

        int unit0 = reader.read();

        if (unit0 < 0)

            return unit0; // EOF



        if (!Character.isHighSurrogate((char)unit0))

            return unit0;



        int unit1 = reader.read();

        if (unit1 < 0)

            return unit1; // EOF



        if (!Character.isLowSurrogate((char)unit1))

            throw new RuntimeException("Invalid surrogate pair");



        return Character.toCodePoint((char)unit0, (char)unit1);

    }



    public void close() throws IOException {

        reader.close();

        reader = null;

    }

}

The main functions needs to be slightly modified:

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;



public final class App {

    public static void main(String args) {

        CodepointStream reader = new CodepointStream(

                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try {

            System.out.print("> ");

            int code = reader.read();

            while (code != -1) {

                String s =

                        String.format("Code %x is `%s', %s.",

                                code,

                                Character.getName(code),

                                new String(Character.toChars(code)));

                System.out.println(s);

                code = reader.read();

            }

        } catch (Exception e) {

        }

    }

}

Then your output becomes:

> keyboard ⌨. pizza 🍕

Code 6b is `LATIN SMALL LETTER K', k.

Code 65 is `LATIN SMALL LETTER E', e.

Code 79 is `LATIN SMALL LETTER Y', y.

Code 62 is `LATIN SMALL LETTER B', b.

Code 6f is `LATIN SMALL LETTER O', o.

Code 61 is `LATIN SMALL LETTER A', a.

Code 72 is `LATIN SMALL LETTER R', r.

Code 64 is `LATIN SMALL LETTER D', d.

Code 20 is `SPACE',  .

Code 2328 is `KEYBOARD', ⌨.

Code 2e is `FULL STOP', ..

Code 20 is `SPACE',  .

Code 70 is `LATIN SMALL LETTER P', p.

Code 69 is `LATIN SMALL LETTER I', i.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 7a is `LATIN SMALL LETTER Z', z.

Code 61 is `LATIN SMALL LETTER A', a.

Code 20 is `SPACE',  .

Code 1f355 is `SLICE OF PIZZA', 🍕.

Code a is `LINE FEED (LF)', 

.

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

edited Nov 13 '18 at 20:34

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

answered Nov 12 '18 at 22:52

Codo

50.7k11110148

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

add a comment |

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

Thank you. I accepted the other answer because it gave a way of making the Java library do the work. Otherwise this is similar to what I would have come up with. Note that you could use Character.toCodePoint and related methods to get rid of the magic constants.
– Isabelle Newbie
Nov 13 '18 at 20:16

Thanks for the hint regarding the Character class. I've updated the code accordingly.
– Codo
Nov 13 '18 at 20:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky