python - How to use threading to help parse a large file -


alright, file 410k lines of code. right parse in 1.4 seconds, need faster. there couple weird things file though...

the file structured (thanks arm): arm fromelf
parse of map key name of structure, , in case can duplicated due arm generating warnings. values in case fields follow.

is there way can use threads split task multiple threads adding data same map?

p.s. not looking me, provided example of file structure understand can't process each line, rather process [start:finish] based off structure.

per request, sample of i'm parsing:

; structure, table , size 0x104 bytes, inputfile.cpp |table.tablesize|                        equ    0        ;  int |table.data|                             equ    0x4      ;  array[64] of myclasshandle ; end of structure table  ; structure, box2 , size 0x8 bytes, inputfile.cpp |box2.|                                  equ    0        ;  anonymous |box2..|                                 equ    0        ;  anonymous |box2...min|                             equ    0        ;  point2 |box2...min.x|                           equ    0        ;  short |box2...min.y|                           equ    0x2      ;  short |box2...max|                             equ    0x4      ;  point2 |box2...max.x|                           equ    0x4      ;  short |box2...max.y|                           equ    0x6      ;  short ; warning: duplicate name (box2..) present in (inputfile.cpp) , in (inputfile.cpp) ; please use --qualify option |box2..|                                 equ    0        ;  anonymous |box2...left|                            equ    0        ;  unsigned short |box2...top|                             equ    0x2      ;  unsigned short |box2...right|                           equ    0x4      ;  unsigned short |box2...bottom|                          equ    0x6      ;  unsigned short ; end of structure box2  ; structure, myclasshandle , size 0x4 bytes, inputfile.cpp |myclasshandle.handle|                   equ    0        ;  pointer myclass ; end of structure myclasshandle  ; structure, point2 , size 0x4 bytes, defects.cpp |point2.x|                               equ    0        ;  short |point2.y|                               equ    0x2      ;  short ; end of structure point2  ; structure, __fpos_t_struct , size 0x10 bytes, c:\program files\ds-5\bin\..\include\stdio.h |__fpos_t_struct.__pos|                  equ    0        ;  unsigned long long |__fpos_t_struct.__mbstate|              equ    0x8      ;  anonymous |__fpos_t_struct.__mbstate.__state1|     equ    0x8      ;  unsigned int |__fpos_t_struct.__mbstate.__state2|     equ    0xc      ;  unsigned int ; end of structure __fpos_t_struct  end 

you better off optimizing parser code, or writing in different language.

in standard python implementation ("cpython"), way multiprocess use multiprocessing module, relies on using multiple unix processes rather threads (threading isn't possible compute-bound tasks because of global interpreter lock). can use shared-memory objects , shared dictionaries (see managers) inter-process communication costly, , consume advantage of multitasking.

if individual threads don't require global information structures during parse, each create own dictionary, , merge dictionaries @ end. it's easy enough send (picklable) python object 1 process another, consider following: task parse textual representation , create internal representation. pickling , unpickling object consists of taking internal representation, producing string it, , parsing string @ other end of communications channel. in other words, your parsing task generates parsing task, additional overhead serialization. that's unlikely of win except unpickler may faster parser have written. brings optimizing parser.

the 1 part of parallelization problem straight-forward splitting tasks between processes. assuming chunks parsed (start:finish) not huge -- is, 410k lines consists of, say, several thousand such subtasks -- there simple strategy:

  1. find size of file, , divide number of tasks (see below.)
  2. give each task byte range: [task_number * task_size, task_number * task_size).
  3. each task following:
    1. open file (so each task has own file descriptor)
    2. seek start byte position
    3. read , discard until end of line
    4. read lines, discarding them until start of section found.
    5. loop:
      1. parse section.
      2. read until first line of start of next section.
      3. if position of first character in start line within range, continue loop.
    6. report result

the problem simple algorithm assumes cost of parse strictly proportional number of characters parsed , threads execute @ same speed. since neither of these assumptions likely, quite possible threads finish considerably before others, , spin wheels waiting more work.

this can avoided splitting file smaller pieces , having each thread next available piece when finishes 1 working on. (of course, have coordinate work queue, that's 1 synchronization per chunk of work, not lot of overhead.) however, didn't recommend above because input file not huge can divided small pieces. since actual start , end of work need found actual scan, there some overhead associated every chunk of work , more chunks there are, more overhead. if chunks small enough, there no actual work @ all. getting tuning parameters right requires more knowledge size of work units revealed question.


Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -

php - $params->set Array between square bracket -