Load a JSON with raw_unicode_escape encoded strings











up vote
0
down vote

favorite












I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?



For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.



# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')


Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.



Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().










share|improve this question
























  • You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
    – tripleee
    Nov 10 at 19:09










  • Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
    – tripleee
    Nov 10 at 21:12






  • 1




    Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
    – deceze
    Nov 11 at 0:18








  • 1




    Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
    – deceze
    Nov 11 at 3:35






  • 1




    I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
    – tripleee
    Nov 11 at 9:44















up vote
0
down vote

favorite












I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?



For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.



# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')


Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.



Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().










share|improve this question
























  • You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
    – tripleee
    Nov 10 at 19:09










  • Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
    – tripleee
    Nov 10 at 21:12






  • 1




    Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
    – deceze
    Nov 11 at 0:18








  • 1




    Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
    – deceze
    Nov 11 at 3:35






  • 1




    I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
    – tripleee
    Nov 11 at 9:44













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?



For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.



# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')


Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.



Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().










share|improve this question















I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?



For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.



# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')


Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.



Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().







python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 at 10:56

























asked Nov 10 at 18:39









Samuele Pilleri

406311




406311












  • You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
    – tripleee
    Nov 10 at 19:09










  • Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
    – tripleee
    Nov 10 at 21:12






  • 1




    Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
    – deceze
    Nov 11 at 0:18








  • 1




    Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
    – deceze
    Nov 11 at 3:35






  • 1




    I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
    – tripleee
    Nov 11 at 9:44


















  • You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
    – tripleee
    Nov 10 at 19:09










  • Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
    – tripleee
    Nov 10 at 21:12






  • 1




    Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
    – deceze
    Nov 11 at 0:18








  • 1




    Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
    – deceze
    Nov 11 at 3:35






  • 1




    I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
    – tripleee
    Nov 11 at 9:44
















You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09




You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09












Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12




Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12




1




1




Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze
Nov 11 at 0:18






Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze
Nov 11 at 0:18






1




1




Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze
Nov 11 at 3:35




Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze
Nov 11 at 3:35




1




1




I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44




I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.



>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è


Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.



Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242216%2fload-a-json-with-raw-unicode-escape-encoded-strings%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.



    >>> from codecs import getreader
    >>>
    >>> with open('file.json', 'r') as input:
    ... reader = getreader('raw_unicode_escape')(input)
    ... j = json.loads(reader.read().encode('raw_unicode_escape'))
    ... print(j['name'])
    ...
    è


    Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.



    Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().






    share|improve this answer



























      up vote
      1
      down vote



      accepted










      Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.



      >>> from codecs import getreader
      >>>
      >>> with open('file.json', 'r') as input:
      ... reader = getreader('raw_unicode_escape')(input)
      ... j = json.loads(reader.read().encode('raw_unicode_escape'))
      ... print(j['name'])
      ...
      è


      Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.



      Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().






      share|improve this answer

























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.



        >>> from codecs import getreader
        >>>
        >>> with open('file.json', 'r') as input:
        ... reader = getreader('raw_unicode_escape')(input)
        ... j = json.loads(reader.read().encode('raw_unicode_escape'))
        ... print(j['name'])
        ...
        è


        Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.



        Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().






        share|improve this answer














        Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.



        >>> from codecs import getreader
        >>>
        >>> with open('file.json', 'r') as input:
        ... reader = getreader('raw_unicode_escape')(input)
        ... j = json.loads(reader.read().encode('raw_unicode_escape'))
        ... print(j['name'])
        ...
        è


        Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.



        Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 at 9:56

























        answered Nov 12 at 10:56









        Samuele Pilleri

        406311




        406311






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242216%2fload-a-json-with-raw-unicode-escape-encoded-strings%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Lugert, Oklahoma