Load a JSON with raw_unicode_escape encoded strings

up vote
0
down vote

favorite

I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?

For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.

# Contents of file 'file.json' ('u00c3u00a8' is 'è')

# { "name": "u00c3u00a8" }

with open('file.json', 'r') as input:

    j = json.load(input)

    j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')

Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.

Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09

Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12

1

Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18

1

Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35

1

I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44

|
show 6 more comments

up vote
0
down vote

favorite

I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?

For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.

# Contents of file 'file.json' ('u00c3u00a8' is 'è')

# { "name": "u00c3u00a8" }

with open('file.json', 'r') as input:

    j = json.load(input)

    j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')

Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.

Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09

Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12

1

Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18

1

Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35

1

I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44

|
show 6 more comments

up vote
0
down vote

favorite

I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?

For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.

# Contents of file 'file.json' ('u00c3u00a8' is 'è')

# { "name": "u00c3u00a8" }

with open('file.json', 'r') as input:

    j = json.load(input)

    j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')

Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.

Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?

For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.

# Contents of file 'file.json' ('u00c3u00a8' is 'è')

# { "name": "u00c3u00a8" }

with open('file.json', 'r') as input:

    j = json.load(input)

    j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')

Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.

Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().

python

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

edited Nov 12 at 10:56

asked Nov 10 at 18:39

Samuele Pilleri

406311

asked Nov 10 at 18:39

Samuele Pilleri

406311

asked Nov 10 at 18:39

Samuele Pilleri

406311

You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09

Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12

1

Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18

1

Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35

1

I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44

|
show 6 more comments

You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09

Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12

1

Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18

1

Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35

1

I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44

You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09

Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12

Simply using json.load on { "name": "u00c3u00a8" } should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18

Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35

I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44

|
show 6 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader

>>>

>>> with open('file.json', 'r') as input:

...     reader = getreader('raw_unicode_escape')(input)

...     j = json.loads(reader.read().encode('raw_unicode_escape'))

...     print(j['name'])

...

è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242216%2fload-a-json-with-raw-unicode-escape-encoded-strings%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader

>>>

>>> with open('file.json', 'r') as input:

...     reader = getreader('raw_unicode_escape')(input)

...     j = json.loads(reader.read().encode('raw_unicode_escape'))

...     print(j['name'])

...

è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

add a comment |

up vote
1
down vote

accepted

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader

>>>

>>> with open('file.json', 'r') as input:

...     reader = getreader('raw_unicode_escape')(input)

...     j = json.loads(reader.read().encode('raw_unicode_escape'))

...     print(j['name'])

...

è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

add a comment |

up vote
1
down vote

accepted

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader

>>>

>>> with open('file.json', 'r') as input:

...     reader = getreader('raw_unicode_escape')(input)

...     j = json.loads(reader.read().encode('raw_unicode_escape'))

...     print(j['name'])

...

è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader

>>>

>>> with open('file.json', 'r') as input:

...     reader = getreader('raw_unicode_escape')(input)

...     j = json.loads(reader.read().encode('raw_unicode_escape'))

...     print(j['name'])

...

è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

edited Nov 15 at 9:56

answered Nov 12 at 10:56

Samuele Pilleri

406311

answered Nov 12 at 10:56

Samuele Pilleri

406311

answered Nov 12 at 10:56

Samuele Pilleri

406311

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

6w5R1ioTCdf WCkev,lfEP vjrbOHZrip 0IST9 6BAchylb3 eiJdRzmCjfu8XNUTHsDXvvwQnK2k,9kw

搜尋此網誌

Ndtyjky