Load a JSON with raw_unicode_escape encoded strings
up vote
0
down vote
favorite
I have a JSON file where strings are encoded in raw_unicode_escape
(the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?
For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.
# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')
Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.
Finally, I should note that the JSON is actually stored in a zip file, so instead of open()
it's ZipFile.open()
.
python
|
show 6 more comments
up vote
0
down vote
favorite
I have a JSON file where strings are encoded in raw_unicode_escape
(the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?
For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.
# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')
Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.
Finally, I should note that the JSON is actually stored in a zip file, so instead of open()
it's ZipFile.open()
.
python
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
1
Simply usingjson.load
on{ "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18
1
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
1
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44
|
show 6 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a JSON file where strings are encoded in raw_unicode_escape
(the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?
For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.
# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')
Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.
Finally, I should note that the JSON is actually stored in a zip file, so instead of open()
it's ZipFile.open()
.
python
I have a JSON file where strings are encoded in raw_unicode_escape
(the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?
For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.
# Contents of file 'file.json' ('u00c3u00a8' is 'è')
# { "name": "u00c3u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')
Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.
Finally, I should note that the JSON is actually stored in a zip file, so instead of open()
it's ZipFile.open()
.
python
python
edited Nov 12 at 10:56
asked Nov 10 at 18:39
Samuele Pilleri
406311
406311
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
1
Simply usingjson.load
on{ "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18
1
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
1
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44
|
show 6 more comments
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
1
Simply usingjson.load
on{ "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.
– deceze♦
Nov 11 at 0:18
1
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
1
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
1
1
Simply using
json.load
on { "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.– deceze♦
Nov 11 at 0:18
Simply using
json.load
on { "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.– deceze♦
Nov 11 at 0:18
1
1
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
1
1
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44
|
show 6 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Since codecs.open('file.json', 'r', 'raw_unicode_escape')
works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode()
.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Since codecs.open('file.json', 'r', 'raw_unicode_escape')
works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode()
.
add a comment |
up vote
1
down vote
accepted
Since codecs.open('file.json', 'r', 'raw_unicode_escape')
works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode()
.
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Since codecs.open('file.json', 'r', 'raw_unicode_escape')
works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode()
.
Since codecs.open('file.json', 'r', 'raw_unicode_escape')
works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode()
.
edited Nov 15 at 9:56
answered Nov 12 at 10:56
Samuele Pilleri
406311
406311
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242216%2fload-a-json-with-raw-unicode-escape-encoded-strings%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory.
– tripleee
Nov 10 at 19:09
Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want?
– tripleee
Nov 10 at 21:12
1
Simply using
json.load
on{ "name": "u00c3u00a8" }
should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem.– deceze♦
Nov 11 at 0:18
1
Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding?
– deceze♦
Nov 11 at 3:35
1
I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader.
– tripleee
Nov 11 at 9:44